Jump to content

Module idea: index + search file-contents


Recommended Posts

I was thinking about writing a module: Enabling PW to index and search contents of uploaded files.

I know the easiest way to include search-inside-files capability is to use Google CSE. Developers who have a dedicated server may of course use of of the "big boys" of search like Lucene, ElasticSearch, Solr etc. For ES you'll need Apache Tomcat, which most people don't have at their disposal, etc.

So, my question is, first of all: Would such a feature be used at all? I know you can create some sort of meta-search with file-descriptions, or when using the "one page per file" approach.

After some brainstorming, I came up with this:

Make it possible to search file upload content (PDFs, Word, Excel)

Build a module (d'oh)

Config settings:
select templates / file-fields (what to index) - list all inputfields type "file"
“index now” button or “index each time a file is added” or cron? Performance?


Where / how to store indexes?

  • As a separate, new field inside each page?
  • On the file-system? In the module folder, each file has a related JSON file? (similar to language files)
  • A new, separate DB-table?

What we need:

  • filename / path / URL
  • filetype (to customize search results with file-icons)
  • page id
  • extracted content
  • timestamp of last index-build? MD5/SHA hash or some such?

How to handle user roles when doing actual search? Inherit from page? Inherit from file-field?




So, what do you think? Not worth it? Does something like this already exist (I searched, but found nothing)?


Edited by dragan
fixed copy-paste glitches
Link to comment
Share on other sites

Since I started working with PW I never needed something like this, but in my pre-PW life I have worked on projects that needed it. Any project that includes some kind of documentation archive solution is a good candidate for a feature like this.

The most obvious example I can remember was the media regulation office here in Portugal. They have weekly meetings where they discuss media incidents, public complaints and things like that. Those reports are publicly available. At one point my (previous) company provided a solution for searching directly in those documents (PDFs). My role was creative only back then so I don't know the details, but basically what our platform was doing then was indexing the text as the file was uploaded.

I'm working to grab a project this year that could potentially make use of something like this.

  • Like 1
Link to comment
Share on other sites

Link to comment
Share on other sites

What about a new field type? Derived from FieldtypeFile but more like FieldtypeMapMarker i.e. a combination type field with all the features of FieldtypeFile but with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

You would certainly need a module, and a way to extend it with different file type parsers (Excel, PDF etc etc).

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

  • Haha 1
Link to comment
Share on other sites

41 minutes ago, DaveP said:

with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

If someone does decide to build this, take a look at the attached pdftotxt file in this post: https://processwire.com/talk/topic/3513-module-site-indexer/?do=findComment&comment=34470 - it's not fancy, but gets the job done. I haven't used it in a PW project yet, but I have it running on an older site for making the site search script return PDFs based on their content.

  • Like 1
Link to comment
Share on other sites

Nice idea! I have no need for it, but if i had, it would be great to have such a feature! And I can think of lots of usecases where someone could need this...


5 hours ago, DaveP said:

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

I'm out :D

  • Like 1
Link to comment
Share on other sites

  • 1 month later...

Further to this, it turns out that http://modules.processwire.com/modules/indexer/ can (and does) index PDFs, and although it was created in 2013 and only claims compatibility up to PW 2.4, it still works. (tested on PW 3.0.84)

<edit class='bit-more-info'>

Not only does it still work, but it works brilliantly. No errors, no warnings, and this is with PHP 7.1! All on default settings, just check 'Use built-in PHP class?'.


  • Like 6
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    • No registered users viewing this page.
  • Similar Content

    • By donatas
      how would I do a multi-language website search with just a selector?
      I have many multi-lang fields and I want to do a search through all of them at once and through all of their language values.
      Is there a "selector way" of doing this? Maybe something like `title|title:de|title:it`? It seems I have seen this somewhere a long time ago but can't find in any documentation or forum search...
      Or the only way of doing it is by running separate searches for each language with output formatting off and then consolidating it all in one single results array?
      Because I still want to give users a result, even if it is in another language than current $user. Visitors mostly will be searching for specific terms that are very similar in all languages, but might be not used in one language version of a single page, for example. Or the user might not have switched language tohis prefered and did the search first, etc.. (many use cases in my situation)
      $pages->find('title~='.$q) - maybe different operator is needed? /en/search/?q=visit = 1 results /it/search/?q=visit = 0 results Thanks for any advice!
    • By donatas
      I am trying (wondering if even possible) to set a very different file folder path for my PW install. The path I want to set exist in the same server, but in another domain. Can the PW API see that far into file system? It is a shared hosting environment so no permission management options available.
      Also can this be achieved through .htaccess redirection? If you request a file (images mostly, but translation json files are important too) from `domain-A.com/site/assets/files/` to grab them from `domain-B.com/site/assets/files/`? I would prefer to do this through PW API, I can see the function `$config->setPath()` but it didn't work when set in  config.php like `$config->setPath("files", "/domains/domain-B/public_html/site/assets/files/")`.
      Is there some other option to do this?
      My ultimate goal is to have two PW installs on different domains but one is just a "mirror" that is using the same database as the other and should use the same files structure if editors upload any images to the main (domain-B.com) website. I could use domain parking function but it then needs a more expensive SSL certificate for two domains , which I am trying to avoid if possible 🙂 (I'd like to use single domain Let's Encrypt certificates, thus I need to PW installs).
      Would appreciate any insight! Thanks!
    • By dotnetic
      Hi folks, I published "Simple file downloads with ProcessWire tutorial"  today which explains how to make a simple download function with ProcessWire (tested with version 3.0+).
      Basically this is based on my post here in the forums 
    • By sebr
      In my search page, I used a selector like this :
      $searchQuery = $sanitizer->entities($input->get('q')); $searchQuery = $sanitizer->selectorValue($searchQuery); $selector = 'title|subtitle|summary|html_body_noimg~=' . $searchQuery; $matches = $pages->find($selector); I don't have the same results if $searchQuery contains accent or not.
      For example,
      with « bâtiment » I have no result with « batiment » I have onea result : « Les bâtiments et les smart-city » Normally I should have the same results? How can I do that ?
      Thanks for your help
    • By michelangelo
      Hello guys, I am building a sort of an archive. Relatively simple, although I have about 8000 records, each with 15 fields (text, int, images, url). I created a crude search system with a form (emulating the famous Skyscrapper example) to filter through the system. Everything works but it is quite slow... I have 2 questions which are related:

      1. How can I search through the database?
      2. What is a good practice to display many records like these?
      1. I am retrieving the results with
      $songs = $pages->findMany('template=nk-song'); Then I do a foreach to render them all. I am unsure if that is a good way. If I render all of them on the page, it creates thousands of divs with a bit of text, and this can take a while (10s-15s).
      2. This one is even worse :D as every time I retrieve my desired records with something like this:
      $page->find("field_to_search_through~=my_query_string") I get between 20 and 200, but when I render them I am creating iframes with YouTube videos and that can take up to 10s to finish. I "solved" it by only loading the iframes if they are in view with IntersectionObserver on the client-side. But I feel there is a more precise PHP / ProcessWire approach.
      Just to clarify, I started doing all of this custom rendering and querying because tools like ElasticSearch or SearchEngine were a bit complicated and I needed a simple to retrieve information and then display it in my own way.
      Thank you!
  • Create New...