Jump to content

SearchEngine File Indexer add-on


teppo
 Share

Recommended Posts

This module is an optional (and still somewhat experimental) add-on for SearchEngine. It adds support for indexing file contents, replacing earlier SearchEngine PDF indexer module.

Features

SearchEngine by itself will only store the name, description, tags, and custom field values for file/image fields. This module, on the other hand, attempts to extract human-readable text from the file itself.

As for file types, at least in theory this module supports any filetype that can be reasonably converted to text. It has built-in support (mostly via third party libraries) for...

  • office documents (.doc, .docx, .rtf, .odf),
  • pdf documents (.pdf),
  • spreadsheets (.xls, .xlsx, .ods, .csv) and
  • plain text (.txt).

The module also ships with a FileIndexer base class and exposes the SearchEngineFileIndexer::addFileIndexer() method for introducing indexers for file types that are not yet supported.

Links

Getting started

  1. install and configure SearchEngine (version 0.34.0 or later),
  2. install SearchEngine File Indexer,
  3. install third party dependencies — if you installed SearchEngineFileIndexer via Composer you should already have these available, otherwise you'll need to run "composer install" in the SearchEngineFileIndexer module directory,
  4. choose which file indexers you'd like to enable.

The rest should happen automagically behind the scenes.

Additional notes

The important thing to note here is that we're going to rely on third party libraries to handle parsing (most) files, and things can still go wrong, so please consider this a beta release. It did work in my early tests, but there's little guarantee that it will work in real life use cases. Just to be safe it is recommended to back up your site before installing and enabling this module.

Another thing to keep in mind is that indexing files can be resource intensive and take plenty of time. As such, this module provides some settings for limiting files by size etc. Regardless, this is something that likely needs further consideration in the future; some future version of this module, or an additional add-on module, may e.g. add support for indexing pages/files "lazily" in the background.

  • Like 4
  • Thanks 1
Link to comment
Share on other sites

  • 3 months later...

Hi teppo, although I believe this request may be more related to the core module (SearchEngine), I'm thinking that the question might belong here. Apologies if that's confusing! First off - in my limited testing, for files that contain properly stored textual data, this has worked as great as can be expected (based on the capability of the vendor classes). Thank you!

My thought here, however, is that if a search result was a hit due to it being a match to contents of a file, the expectation would likely be that the file would end up getting listed as a search result (link). Currently due to how SearchEngine (and ultimately ProcessWire) works, the resultant link is for the page that a file resides on. Since there may be circumstances where many multiples of files exist on a page, or are somewhat hidden, would it be possible to link directly to the matched file instead of the owning page?

Random thoughts on this, not sure if it's on the right path or not...:

SearchEngineFileIndexer has a configurable option to create separate (database level) index_field entry (or custom-named) field entries specific to the files/images rather than the page (or both, or only the page [default]), and similarly to SearchEngine, what to use as the result_summary_field. This would likely require an alteration to the schema of the index_field from the SearchEngine module, such as, if a file/image, its a column for its path (and if no path, we know it's not a file/image but a standard page).

SearchEngine would then, if detecting the extra field in the database, optionally provide a direct link to the file instead of the owning page's URL. Similarly, it may (in certain places) pass along an is_file attribute since I don't know if AJAX-based searches for auto-suggest might want file matching results to show up in all situations (and that could be one way for the developer to prevent it).

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...