Jump to content

Module idea: index + search file-contents


dragan
 Share

Recommended Posts

I was thinking about writing a module: Enabling PW to index and search contents of uploaded files.

I know the easiest way to include search-inside-files capability is to use Google CSE. Developers who have a dedicated server may of course use of of the "big boys" of search like Lucene, ElasticSearch, Solr etc. For ES you'll need Apache Tomcat, which most people don't have at their disposal, etc.

So, my question is, first of all: Would such a feature be used at all? I know you can create some sort of meta-search with file-descriptions, or when using the "one page per file" approach.

After some brainstorming, I came up with this:

Idea:
Make it possible to search file upload content (PDFs, Word, Excel)


Approach:
Build a module (d'oh)


Config settings:
select templates / file-fields (what to index) - list all inputfields type "file"
“index now” button or “index each time a file is added” or cron? Performance?

 

Where / how to store indexes?

  • As a separate, new field inside each page?
  • On the file-system? In the module folder, each file has a related JSON file? (similar to language files)
  • A new, separate DB-table?


What we need:

  • filename / path / URL
  • filetype (to customize search results with file-icons)
  • page id
  • extracted content
  • timestamp of last index-build? MD5/SHA hash or some such?

How to handle user roles when doing actual search? Inherit from page? Inherit from file-field?

 

 

 

So, what do you think? Not worth it? Does something like this already exist (I searched, but found nothing)?

 

Edited by dragan
fixed copy-paste glitches
Link to comment
Share on other sites

Since I started working with PW I never needed something like this, but in my pre-PW life I have worked on projects that needed it. Any project that includes some kind of documentation archive solution is a good candidate for a feature like this.

The most obvious example I can remember was the media regulation office here in Portugal. They have weekly meetings where they discuss media incidents, public complaints and things like that. Those reports are publicly available. At one point my (previous) company provided a solution for searching directly in those documents (PDFs). My role was creative only back then so I don't know the details, but basically what our platform was doing then was indexing the text as the file was uploaded.

I'm working to grab a project this year that could potentially make use of something like this.

  • Like 1
Link to comment
Share on other sites

Link to comment
Share on other sites

What about a new field type? Derived from FieldtypeFile but more like FieldtypeMapMarker i.e. a combination type field with all the features of FieldtypeFile but with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

You would certainly need a module, and a way to extend it with different file type parsers (Excel, PDF etc etc).

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

  • Haha 1
Link to comment
Share on other sites

41 minutes ago, DaveP said:

with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

If someone does decide to build this, take a look at the attached pdftotxt file in this post: https://processwire.com/talk/topic/3513-module-site-indexer/?do=findComment&comment=34470 - it's not fancy, but gets the job done. I haven't used it in a PW project yet, but I have it running on an older site for making the site search script return PDFs based on their content.

  • Like 1
Link to comment
Share on other sites

Nice idea! I have no need for it, but if i had, it would be great to have such a feature! And I can think of lots of usecases where someone could need this...

[offtopic]

5 hours ago, DaveP said:

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

I'm out :D

  • Like 1
Link to comment
Share on other sites

  • 1 month later...

Further to this, it turns out that http://modules.processwire.com/modules/indexer/ can (and does) index PDFs, and although it was created in 2013 and only claims compatibility up to PW 2.4, it still works. (tested on PW 3.0.84)

<edit class='bit-more-info'>

Not only does it still work, but it works brilliantly. No errors, no warnings, and this is with PHP 7.1! All on default settings, just check 'Use built-in PHP class?'.

</edit>

  • Like 6
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...