Jump to content

SearchEngine PDF Indexer add-on


teppo
 Share

Recommended Posts

This module has been superseded by SearchEngineFileIndexer and there will be no further development for it.

---

This module is an optional — and experimental — add-on for SearchEngine. It adds support for indexing PDF file contents.

While SearchEngine is technically able to index file fields and contained Pagefile(s), it will natively only store the name and description of each file (and hopefully soon custom field values as well). This module hooks into Pagefile indexing and, if said Pagefile looks like a valid PDF document, attempts to extract human-readable text from the file itself.

Getting started is straightforward: install and configure SearchEngine, install SearchEngine PDF Indexer, and choose which PDF parser library you'd like to use. The rest should happen automagically behind the scenes.

---

Now, as you may or may not know, PDF files are notoriously difficult to process programmatically. For this reason a) we're going to rely on third party libraries to handle parsing them, and b) things can still go wrong, so please consider this module an early beta release. It did work in my early tests, but there's little guarantee that it will work in real life use cases, and as such I'd recommend backing up your site before installing/enabling this module 🙂

Also: while this module can be installed via the admin or by cloning/downloading module from the GitHub repository, please note that you need to run composer install in the module's directory — or preferably install the whole module via Composer. This is mainly because I really don't like bundling dependencies with the module, especially when there's a bunch of them.

(... although if you dislike Composer or for whatever reason can't use it, feel free to load either smalot/pdfparser or spatie/pdf-to-text manually. Just make sure that they're available by the time the module's class file is constructed.)

--

If you get a chance to use this module, please let me know how it went 🙂

  • Like 6
Link to comment
Share on other sites

Hey @teppo thx for sharing!

13 hours ago, teppo said:

This is mainly because I really don't like bundling dependencies with the module, especially when there's a bunch of them.

A bit offtopic sorry but do you have any good reasons for that? I recently took the opposite route and added some dependencies to one of my modules because I did not like the extra step of composer for making the module work. The dependencies where small in my opinion, only around 200 or 300kB but others might judge differently. I'd be happy to hear your thoughts about that topic!

  • Like 1
Link to comment
Share on other sites

11 hours ago, bernhard said:

A bit offtopic sorry but do you have any good reasons for that? I recently took the opposite route and added some dependencies to one of my modules because I did not like the extra step of composer for making the module work. The dependencies where small in my opinion, only around 200 or 300kB but others might judge differently. I'd be happy to hear your thoughts about that topic!

No worries, this is always an interesting topic to discuss, as off-topic as it may be. My answer will be a bit lengthy, so I'll wrap it in a spoiler tag (feels overkill to split this into new topic) 🙂

Spoiler

The truth is that this is at least partially about personal preference, i.e. I like to keep third party dependencies as loosely coupled with my own code as possible and feel that Composer is the "modern way" to handle PHP dependency management. There are some technical reasons behind this as well, though.

Here are a few of my "less opinionated" reasons, in no particular order:

  1. Managing dependency updates is super easy with Composer, literally single command required. And if my dependencies have dependencies of their own, those are also automatically taken care of. And at the same time I can lock dependency versions to minimum/maximum/exact major, minor, or patch version, if need be.
  2. Managing requirements is equally straightforward, e.g. in case one of my dependencies decides to drop PHP 7 support or depends on a specific application to be installed OS level that isn't available on current host. Of course this only works as long as third party dependencies specify their requirements properly — but if they don't, that's a huge red flag anyway.
  3. Since third party code is not bundled with my own stuff, in case I want to run code quality analyzers etc. things tend to be more straightforward and more efficient. At the very least I don't need to specifically instruct the tool to omit this and that.
  4. ... and loosely related to previous point: while code analyzers work nicely on my own code (per-project "custom" code), at the same time I can rely on tools like roave/security-advisories to make sure that my dependencies don't have vulnerabilities. Again doable without Composer, but easier with it, and one less thing to keep track of (and potentially forget).
  5. Finally, using dependencies in my own code code is dead simple with Composer autoload. No need to manually include specific files; declare the dependency, include Composer autoload file (usually once per project), and be done with it.

In my case one additional benefit is that in case there are shared dependencies between modules or libraries, I only need one of each. And, at the same time, at the very least I'm clearly alerted in case modules/libraries have incompatible dependencies — e.g. one module requiring Guzzle 5 and another Guzzle 7. If both just manually loaded their versions, that could cause major (and potentially randomly manifesting and thus difficult to debug) issues.

It's probably obvious by now, but I tend to install as many dependencies as possible via Composer. If a PHP library doesn't provide Composer support, it's very unlikely that I would use it at all — and if a module isn't installable via Composer, I either install it directly via the repository, or alternatively create a private fork and use that instead. This way benefits mentioned above accumulate: the more dependencies I have,  the more benefit I get from using Composer to manage them.

Disk space is rarely a real concern, in my opinion. Few hundred KBs here or there won't make much of a difference on modern hardware. It's everything else that matters more 🙂

... and all that being said, I have also opted to bundle dependencies with my modules in a number of cases. Mostly because those modules have been targeted at wider audience, where some/many may not even be aware of Composer, let alone have access to it. I will likely continue to do that for the unforeseeable future with more "mainstream" modules — even if I personally feel it's a bit old-school 😉

  • Like 3
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...