Jump to content

Recommended Posts

Posted (edited)

I was thinking about writing a module: Enabling PW to index and search contents of uploaded files.

I know the easiest way to include search-inside-files capability is to use Google CSE. Developers who have a dedicated server may of course use of of the "big boys" of search like Lucene, ElasticSearch, Solr etc. For ES you'll need Apache Tomcat, which most people don't have at their disposal, etc.

So, my question is, first of all: Would such a feature be used at all? I know you can create some sort of meta-search with file-descriptions, or when using the "one page per file" approach.

After some brainstorming, I came up with this:

Idea:
Make it possible to search file upload content (PDFs, Word, Excel)


Approach:
Build a module (d'oh)


Config settings:
select templates / file-fields (what to index) - list all inputfields type "file"
“index now” button or “index each time a file is added” or cron? Performance?

 

Where / how to store indexes?

  • As a separate, new field inside each page?
  • On the file-system? In the module folder, each file has a related JSON file? (similar to language files)
  • A new, separate DB-table?


What we need:

  • filename / path / URL
  • filetype (to customize search results with file-icons)
  • page id
  • extracted content
  • timestamp of last index-build? MD5/SHA hash or some such?

How to handle user roles when doing actual search? Inherit from page? Inherit from file-field?

 

 

 

So, what do you think? Not worth it? Does something like this already exist (I searched, but found nothing)?

 

Edited by dragan
fixed copy-paste glitches
Posted

Since I started working with PW I never needed something like this, but in my pre-PW life I have worked on projects that needed it. Any project that includes some kind of documentation archive solution is a good candidate for a feature like this.

The most obvious example I can remember was the media regulation office here in Portugal. They have weekly meetings where they discuss media incidents, public complaints and things like that. Those reports are publicly available. At one point my (previous) company provided a solution for searching directly in those documents (PDFs). My role was creative only back then so I don't know the details, but basically what our platform was doing then was indexing the text as the file was uploaded.

I'm working to grab a project this year that could potentially make use of something like this.

  • Like 1
Posted
Posted

What about a new field type? Derived from FieldtypeFile but more like FieldtypeMapMarker i.e. a combination type field with all the features of FieldtypeFile but with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

You would certainly need a module, and a way to extend it with different file type parsers (Excel, PDF etc etc).

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

  • Haha 1
Posted
41 minutes ago, DaveP said:

with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

If someone does decide to build this, take a look at the attached pdftotxt file in this post: https://processwire.com/talk/topic/3513-module-site-indexer/?do=findComment&comment=34470 - it's not fancy, but gets the job done. I haven't used it in a PW project yet, but I have it running on an older site for making the site search script return PDFs based on their content.

  • Like 1
Posted

Nice idea! I have no need for it, but if i had, it would be great to have such a feature! And I can think of lots of usecases where someone could need this...

[offtopic]

5 hours ago, DaveP said:

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

I'm out :D

  • Like 1
  • 1 month later...
Posted

Further to this, it turns out that http://modules.processwire.com/modules/indexer/ can (and does) index PDFs, and although it was created in 2013 and only claims compatibility up to PW 2.4, it still works. (tested on PW 3.0.84)

<edit class='bit-more-info'>

Not only does it still work, but it works brilliantly. No errors, no warnings, and this is with PHP 7.1! All on default settings, just check 'Use built-in PHP class?'.

</edit>

  • Like 6
Posted

Update re previous post

Following on from a bit more testing, I have forked the Indexer module mentioned above and started a new thread...

 

  • Like 3

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...