Jump to content

SearchEngine PDF Indexer add-on


teppo
 Share

Recommended Posts

This module has been superseded by SearchEngineFileIndexer and there will be no further development for it.

---

This module is an optional — and experimental — add-on for SearchEngine. It adds support for indexing PDF file contents.

While SearchEngine is technically able to index file fields and contained Pagefile(s), it will natively only store the name and description of each file (and hopefully soon custom field values as well). This module hooks into Pagefile indexing and, if said Pagefile looks like a valid PDF document, attempts to extract human-readable text from the file itself.

Getting started is straightforward: install and configure SearchEngine, install SearchEngine PDF Indexer, and choose which PDF parser library you'd like to use. The rest should happen automagically behind the scenes.

---

Now, as you may or may not know, PDF files are notoriously difficult to process programmatically. For this reason a) we're going to rely on third party libraries to handle parsing them, and b) things can still go wrong, so please consider this module an early beta release. It did work in my early tests, but there's little guarantee that it will work in real life use cases, and as such I'd recommend backing up your site before installing/enabling this module ?

Also: while this module can be installed via the admin or by cloning/downloading module from the GitHub repository, please note that you need to run composer install in the module's directory — or preferably install the whole module via Composer. This is mainly because I really don't like bundling dependencies with the module, especially when there's a bunch of them.

(... although if you dislike Composer or for whatever reason can't use it, feel free to load either smalot/pdfparser or spatie/pdf-to-text manually. Just make sure that they're available by the time the module's class file is constructed.)

--

If you get a chance to use this module, please let me know how it went ?

  • Like 6
Link to comment
Share on other sites

Hey @teppo thx for sharing!

  On 7/16/2022 at 12:21 AM, teppo said:

This is mainly because I really don't like bundling dependencies with the module, especially when there's a bunch of them.

Expand  

A bit offtopic sorry but do you have any good reasons for that? I recently took the opposite route and added some dependencies to one of my modules because I did not like the extra step of composer for making the module work. The dependencies where small in my opinion, only around 200 or 300kB but others might judge differently. I'd be happy to hear your thoughts about that topic!

  • Like 1
Link to comment
Share on other sites

  On 7/16/2022 at 1:43 PM, bernhard said:

A bit offtopic sorry but do you have any good reasons for that? I recently took the opposite route and added some dependencies to one of my modules because I did not like the extra step of composer for making the module work. The dependencies where small in my opinion, only around 200 or 300kB but others might judge differently. I'd be happy to hear your thoughts about that topic!

Expand  

No worries, this is always an interesting topic to discuss, as off-topic as it may be. My answer will be a bit lengthy, so I'll wrap it in a spoiler tag (feels overkill to split this into new topic) ?

  Reveal hidden contents
  • Like 3
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...