Jump to content

PDF image generator


adrian
 Share

Recommended Posts

I have put together a very basic module for generating images from the pages of uploaded PDFs.

It requires imagemagick, ghostscript, and the imagick pecl extension. It could easily be adapted to work without the imagick extension, but I usually like having exec disabled.

At the moment it also requires a couple of specific custom fields: document_pdf (file) and document_thumb (image), both with "Maximum files allowed" set to "1". Obviously I will make these more generic, or add the fields on module install once this is closer to being released.

There is some commented code that facilitates image resizing before upload to PW if that is what is wanted. I should also make these configurable module options.

Currently I am using this for a searchable list of publications to generate a thumbnail of the cover to place with the description, PDF download link etc.

I am planning on extending this a fair bit as I want to use it to generate image previews for each page of the PDF so I can use them in a "look inside" modal lightbox. These would all be stored in a standard images field.

I think when I start generating images for all the pages I am going to run up against a speed problem. It could potentially take a few minutes to generate all the images for a large PDF. I was thinking of making use of lazycron, but I think I'd rather see some progress indicator. The other issue is if I use lazycron, then it is possible that someone may visit the site and go to use the look inside functionality before the images for all the pages are ready, so maybe this idea is out. Perhaps the best approach would be to hook into the PDF upload. Once the upload is complete, the module would start generating the images before the page is ever saved by the user. Would really appreciate any suggestions on how to set this up.

Off the topic a little, but one other minor consideration in all this is RGB vs CMYK colorspace. If the user uploads an RGB PDF then everything is fine, but if someone uploads CMYK PDFs the colors are often terrible, but there is a tweak that can be made to imagemagick to fix it. Here is a good description of the problem and how to fix it: http://www.lassosoft.com/CMYK-Colour-Matching-with-ImageMagick

 
Basically you need to add the command line option ''-dUseCIEColor'' to all of the GhostScript commands in the delegates.xml file, so that they now look like this:
<delegate decode="eps" encode="ps" mode="bi" command='"gs" -q -dBATCH 
      -dSAFER -dUseCIEColor -dMaxBitmap=500000000 -dNOPAUSE -dAlignToPixels=0
      -sDEVICE="pswrite" -sOutputFile="%o" -f"%i"' />
 

Would anyone else make use of this module?

Also, does anyone have any obvious suggestions for what I have so far. It's my first module, and I don't really have a handle on best practices yet.

Thanks!

PS I know convention is to attach module files, but this one is so short at the moment, it didn't seem worthwhile.

<?php

/**
 * ProcessWire ProcessPDFImageCreator
 *
 * Process PDF Image Creator creates images from PDFs.
 *
 * @copyright Copyright (c) 2013, Adrian Jones
 *
 */

class ProcessPDFImageCreator extends WireData implements Module {

    /**
     * Return information about this module (required)
     *
     * @return array
     *
     */
    static public function getModuleInfo() {
        return array(
            'title'    => 'PDF Image Creator',
            'summary'  => 'Creates images from PDFs.',
            'version'  => 001,
            'author'   => 'Adrian Jones',
            'singular' => true,
            'autoload' => true
        );
    }


    /**
     * Initialize the module and setup hook
     */
    public function init() {
        $this->pages->addHookAfter('save', $this, 'createPdfImage');
    }

    /**
     * If document_pdf field contains a PDF, generate an image from the first page.
     *
     *
     */
    public function createPdfImage($event) {

        $page = $event->arguments[0];

        if(count($page->document_pdf)>0){

            $src = $page->document_pdf->first()->url;

            if(count($page->document_thumb)==0){

                $pdf_filepath = $page->document_pdf->first()->filename . '[0]'; //the appended [0] refers to the first page of the PDF
                $jpg_filepath = str_replace('.pdf', '.jpg', $page->document_pdf->first()->filename);

                $resolution = 288;

                $im = new imagick();
                $im->setOption("pdf:use-cropbox","true");
                $im->setColorspace(Imagick::COLORSPACE_RGB);
                $im->setResolution($resolution,$resolution);
                $im->readImage($pdf_filepath);

                $geometry=$im->getImageGeometry();

                $width = ceil($geometry['width'] / ($resolution/72));
                $height = ceil($geometry['height'] / ($resolution/72));

                $im->setImageFormat("jpg");

                /*if($width>150){
                    $width = 150;
                    $height = 0;
                }
                $im->scaleImage($width, $height);*/

                $im->writeImage($jpg_filepath);

                $page->of(false);
                $page->document_thumb->add($jpg_filepath);
                $page->document_thumb->first()->description = $page->title . ' PDF thumbnail';
                $page->save();
            }

        }

    }

}
  • Like 7
Link to comment
Share on other sites

You could hook on save page instead. Just check if images already exist for the existing files, and if not, trigger the process for those that are missing.

Link to comment
Share on other sites

Hey diogo - thanks for your thoughts. So do you mean on save, rather than what I have which is after save? I changed it addHookAfter to just AddHook and it seems to work fine still. I already have a check to see if the document_thumb image field is populated. Anyway, I'll be sure to check for missing images from the full collection of pages images (in case something crashes, or the user kills the process), and just generate the missing ones on the next page save - good idea - thanks!

Once all the page images have been generated, subsequent saves of the page for other edits should be quick since it won't regenerate the images, although I should probably build in a check to see if the PDF has been changed from the original uploaded version so that new images are generated without the user needing to delete the old ones to trigger the function again. Any ideas how to check if the PDF has been replaced during the current edit?

Maybe the key thing is to add some JS loader when saving to ask them to be patient. Anyone have any better ideas?

Link to comment
Share on other sites

Ok, here is v2. It now uses a standard multi images field to store the images, and creates an image for each page of the PDF. I figure that way the template can just grab the first one where it needs to generate a cover thumbnail. Not relevant to the module, but I added a document_thumb_override field that allows a user to upload an image to use in place of the first page image if they'd prefer. Obviously the template checks to see if this is available before using the first page image.

I have several comments throughout the code on things I will add shortly:

  • Ability to see what images are missing (if any) and just generate those.
  • Check if the PDF was updated (so need to generate images again).
  • Add some loading spinner as this definitely takes a while
  • Ability to control set the required or max limit on the file dimensions - for a standard letter sized doc, the images are currently generated at 2448px x 3168px. The commented out scaleImage code needs to be activated, but tied to module config settings.

Also, should I have the module create any required fields, or just provide an explanation of how to adapt it to a user's particular scenario / field structure?

Thanks for any comments/suggestions.

<?php

/**
 * ProcessWire ProcessPDFImageCreator
 *
 * Process PDF Image Creator creates images from PDFs.
 *
 * @copyright Copyright (c) 2013, Adrian Jones
 *
 */

class ProcessPDFImageCreator extends WireData implements Module {

    /**
     * Return information about this module (required)
     *
     * @return array
     *
     */
    static public function getModuleInfo() {
        return array(
            'title'    => 'PDF Image Creator',
            'summary'  => 'Creates images from PDFs.',
            'version'  => 002,
            'author'   => 'Adrian Jones',
            'singular' => true,
            'autoload' => true
        );
    }


    /**
     * Initialize the module and setup hook
     */
    public function init() {
        $this->pages->addHook('save', $this, 'createPdfImages');
    }

    public function getNumPagesPdf($filepath){

        //The FPDI method is the best combination of accuracy and efficiency, but introduces yet another dependancy. There are also issues with the free version of FPDI and support for PDF > 1.4
        //If you have the paid version, I think this is the best option. It is available here: http://www.setasign.de/products/pdf-php-solutions/fpdi/

        /*require_once('/tcpdf/tcpdf.php');
        require_once('/fpdi/fpdi.php');

        $pdf =& new FPDI();

        $pagecount = $pdf->setSourceFile($filepath);
        return $pagecount;*/

        //If you have exec available, you can also try this option: http://stackoverflow.com/questions/14644353/finally-found-a-fast-easy-and-accurate-way-to-get-the-number-of-pages-in-a-pdf/14644354#14644354

        //The first of option here is because the imagick method is quite slow. It runs first and only resorts to the imagick option if it fails. Unfortunately, it often fails!
        $fp = @fopen(preg_replace("/\[(.*?)\]/i", "",$filepath),"r");
        $max=0;
        while(!feof($fp)) {
                $line = fgets($fp,255);
                if (preg_match('/\/Count [0-9]+/', $line, $matches)){
                        preg_match('/[0-9]+/',$matches[0], $matches2);
                        if ($max<$matches2[0]) $max=$matches2[0];
                }
        }
        fclose($fp);

        //If above failed ($max==0), then resort to imagick
        if($max==0){
            $im = new imagick($filepath);
            $max=$im->getNumberImages();
        }

        return $max;
    }


    /**
     * If document_pdf field contains a PDF, generate images for each page, stored in a standard multi images field.
     * Should maybe switch document_pdf field to a standard files field, or maybe get modules to install the field. Should the module check availability of required fields in general when installing?
     *
     */
    public function createPdfImages($event){
        $page = $event->arguments[0];

        if(count($page->document_pdf)>0){

            //Need to modify check with an OR to determine if PDF file was just updated from past uploaded version and also see if only some of the images exist and only generate the missing ones
            //Also need to store number of pages in document_pdf_num_pages field so don't have to run that check every time, unless the PDF changed.
            if(count($page->images)==0){
                $numPages = $this->getNumPagesPdf($page->document_pdf->first()->filename);

                for ($pn=0; $pn<$numPages; $pn++){
                    $this->createPdfImage($page, 'images', $pn);
                }
            }

        }
    }

    /**
     * Generate images.
     *
     */
    public function createPdfImage($page, $image_field, $pn) {

        $pdf_filepath = $page->document_pdf->first()->filename . '['.$pn.']';
        $jpg_filepath = str_replace('.pdf', '-'.$pn.'.jpg', $page->document_pdf->first()->filename);

        $resolution = 288; // Can't remember where I got 288 from - I think mostly trial and error many years ago, but seems to give best results

        $im = new imagick();
        $im->setOption("pdf:use-cropbox","true");
        $im->setColorspace(Imagick::COLORSPACE_RGB);
        $im->setResolution($resolution,$resolution);
        $im->readImage($pdf_filepath);

        $geometry=$im->getImageGeometry();

        $width = ceil($geometry['width'] / ($resolution/72));
        $height = ceil($geometry['height'] / ($resolution/72));

        $im->setImageFormat("jpg");

        /*if($width>150){
            $width = 150;
            $height = 0;
        }
        $im->scaleImage($width, $height);*/

        $im->writeImage($jpg_filepath);

        $page->of(false);
        $page->$image_field->add($jpg_filepath);
        $page->$image_field->last()->description = $page->title . ' Page ' . ($pn+1);
        $page->save();
    }

}
Link to comment
Share on other sites

Very cool Adrian! A few suggestions:

  • Make your version number 2 rather than 002. PHP might interpret that as Octal or something. 
  • Add an ___install() method that does a class_exists('imagick') and throws a WireException if it doesn't. 
  • Make the ___install() method add the document_pdf field, or make the module configurable so people can tell it what field to use. 
  • Put it on GitHub.
  • Add to the modules directory.
  • Profit! (as Adam would say)
Link to comment
Share on other sites

Thanks for all the great suggestions Ryan. I will definitely add those dependency checks and implement some config options, including the field, max image dimensions etc and make available on github / modules directory.

I didn't really have any time to work on this yesterday, but I took a quick look at using $page->isChanged("field") in an attempt to determine whether the user has uploaded a new version of the PDF during the current page editing session, so it knows to regenerate the images. It didn't seem to work as I expected. Looking at the docs and cheatsheet it seems like I don't need to turn on change tracking in this scenario, but maybe I not properly understanding how this works. This seems like it should be the most elegant way to check this.

The other thing I need to do is ensure Ghostscript is installed. I don't think there is a PHP way to do this without exec or system, so I am wondering what you think about having the module include a very small PDF file that the module install method can do a test readImage on. Would that be too hackish? I don't want to blemish PW :)

Link to comment
Share on other sites

  • 4 months later...

This is still not really finished. It definitely works and has been tested on a couple of different sites for several months now, but I think it still needs some refining before being ready for prime time. 

It is now on GitHub: https://github.com/adrianbj/ProcessPDFImageCreator

Will wait for module directory submission until I get time to make some refinements though.

  • Like 2
Link to comment
Share on other sites

  • 1 year later...

Adrian

This should work for displaying the cover of a PDF in ListerPro?

I'm using ListerPro to list a virtual directory of PDFs. Essentially each PDF is a page with a Files field. I drop my PDF in there and then use ListerPro to display the following

post-1166-0-69540200-1433192390_thumb.pn

I'll try it tomorrow but thought I'd check first. Unfortunately, I already have my File and Images fields named so that may affect things?

Link to comment
Share on other sites

Hi Peter,

This was the first module I ever put together and I still haven't really finalized it - I have it working on a few sites, but it really needs some configuration options - what images field to use, what pages of the PDF to create images for, etc.

You might be better off with this module by Richard Jedlička: https://processwire.com/talk/topic/6426-pdf-fieldtypeinputfield/ although since it is a separate fieldtype you might have a little trouble migrating your content from your existing files field.

If that is an issue, perhaps just tweak the code in this module of mine to suit your needs, because it sounds like you don't want images for every page, just the cover.

Let me know if there is anything I can do to help.

Link to comment
Share on other sites

  • 4 years later...

Just have to let at least my words here - the last months sadly i don't had much work with PW - but now i was in need of pdf thumbs....and there are two modules....first one doesn't work.

This one runs for me now on shared host, PHP 7.2 and PW 3.101 and i have to say thank you again to Adrian!

Even the small not polished modules from you work as a charm over years ?

Best regards from a atm sparetime PW user.....(two daughters have taken over my time almost complete - best wishes to @Pete )

mr-fan

  • Like 3
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...