Jump to content
dragan

Module idea: index + search file-contents

Recommended Posts

I was thinking about writing a module: Enabling PW to index and search contents of uploaded files.

I know the easiest way to include search-inside-files capability is to use Google CSE. Developers who have a dedicated server may of course use of of the "big boys" of search like Lucene, ElasticSearch, Solr etc. For ES you'll need Apache Tomcat, which most people don't have at their disposal, etc.

So, my question is, first of all: Would such a feature be used at all? I know you can create some sort of meta-search with file-descriptions, or when using the "one page per file" approach.

After some brainstorming, I came up with this:

Idea:
Make it possible to search file upload content (PDFs, Word, Excel)


Approach:
Build a module (d'oh)


Config settings:
select templates / file-fields (what to index) - list all inputfields type "file"
“index now” button or “index each time a file is added” or cron? Performance?

 

Where / how to store indexes?

  • As a separate, new field inside each page?
  • On the file-system? In the module folder, each file has a related JSON file? (similar to language files)
  • A new, separate DB-table?


What we need:

  • filename / path / URL
  • filetype (to customize search results with file-icons)
  • page id
  • extracted content
  • timestamp of last index-build? MD5/SHA hash or some such?

How to handle user roles when doing actual search? Inherit from page? Inherit from file-field?

 

 

 

So, what do you think? Not worth it? Does something like this already exist (I searched, but found nothing)?

 

Edited by dragan
fixed copy-paste glitches

Share this post


Link to post
Share on other sites

Since I started working with PW I never needed something like this, but in my pre-PW life I have worked on projects that needed it. Any project that includes some kind of documentation archive solution is a good candidate for a feature like this.

The most obvious example I can remember was the media regulation office here in Portugal. They have weekly meetings where they discuss media incidents, public complaints and things like that. Those reports are publicly available. At one point my (previous) company provided a solution for searching directly in those documents (PDFs). My role was creative only back then so I don't know the details, but basically what our platform was doing then was indexing the text as the file was uploaded.

I'm working to grab a project this year that could potentially make use of something like this.

  • Like 1

Share this post


Link to post
Share on other sites

Share this post


Link to post
Share on other sites

What about a new field type? Derived from FieldtypeFile but more like FieldtypeMapMarker i.e. a combination type field with all the features of FieldtypeFile but with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

You would certainly need a module, and a way to extend it with different file type parsers (Excel, PDF etc etc).

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

  • Haha 1

Share this post


Link to post
Share on other sites
41 minutes ago, DaveP said:

with an added 'index' or 'contents' subfield (is that a thing?). Let's call it FieldtypeIndexedFile.

If someone does decide to build this, take a look at the attached pdftotxt file in this post: https://processwire.com/talk/topic/3513-module-site-indexer/?do=findComment&comment=34470 - it's not fancy, but gets the job done. I haven't used it in a PW project yet, but I have it running on an older site for making the site search script return PDFs based on their content.

  • Like 1

Share this post


Link to post
Share on other sites

Nice idea! I have no need for it, but if i had, it would be great to have such a feature! And I can think of lots of usecases where someone could need this...

[offtopic]

5 hours ago, DaveP said:

Should be doable, but way beyond my abilities to even attempt. (Just watch someone have one working in about an hour from now.)

I'm out :D

  • Like 1

Share this post


Link to post
Share on other sites

Further to this, it turns out that http://modules.processwire.com/modules/indexer/ can (and does) index PDFs, and although it was created in 2013 and only claims compatibility up to PW 2.4, it still works. (tested on PW 3.0.84)

<edit class='bit-more-info'>

Not only does it still work, but it works brilliantly. No errors, no warnings, and this is with PHP 7.1! All on default settings, just check 'Use built-in PHP class?'.

</edit>

  • Like 6

Share this post


Link to post
Share on other sites

Update re previous post

Following on from a bit more testing, I have forked the Indexer module mentioned above and started a new thread...

 

  • Like 3

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By snobjorn
      I have a website with multiple content types that I want to be accessible through search. I really like the live search on processwire.com, that sorts content types while typing. I tried to find the code to recreate this, with no luck. Does anyone know if this is jquery, specific jquery plugins, json/xml cached files, and what kind of PHP code is used? Any tip that point me in the right direction would be much apperciated.
      The search result listing seems fairly easy to create with sorting through parameters.
    • By gebeer
      Hello all,
      wasn't sure where to put this, so it goes in General section.
      Ryan shows a hook that we can use to mirror files on demand from live server to development environment to be up to date with the files on the server without having to download complete site/assets/files folder.
      I just implemented this but had problems getting files to load from a site in development that is secured with user/password via htaccess.
      First I tried to use WireHttp setHeader method for basic authentication like this
      function mirrorFilesfromLiveServer(HookEvent $event) { $config = $event->wire('config'); $file = $event->return; if ($event->method == 'url') { // convert url to disk path $file = $config->paths->root . substr($file, strlen($config->urls->root)); } if (!file_exists($file)) { // download file from source if it doesn't exist here $src = 'http://mydomain.com/site/assets/files/'; $url = str_replace($config->paths->files, $src, $file); $http = new WireHttp(); // basic authentication $u = 'myuser'; $pw = 'mypassword'; $http->setHeader('Authorization: Basic', base64_encode("$u:$pw")); $http->download($url, $file); } } But, unfortunately this didn't work.
      So now I am using curl to do the download. My hook function now looks like this
      function mirrorFilesfromLiveServer(HookEvent $event) { $config = $event->wire('config'); $file = $event->return; if ($event->method == 'url') { // convert url to disk path $file = $config->paths->root . substr($file, strlen($config->urls->root)); } if (!file_exists($file)) { // download file from source if it doesn't exist here $src = 'http://mydomain.com/site/assets/files/'; $fp = fopen($file, 'w+'); // init file pointer $url = str_replace($config->paths->files, $src, $file); $u = 'myuser'; $pw = 'mypassword'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_TIMEOUT, 50); // crazy high timeout just in case there are very large files curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_USERPWD, "$u:$pw"); // authentication curl_setopt($ch, CURLOPT_HTTPAUTH, CURLAUTH_BASIC); // authentication curl_setopt($ch, CURLOPT_FILE, $fp); // give curl the file pointer so that it can write to it curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); $data = curl_exec($ch); curl_close($ch); } } Now I can load files and images from the htaccess protected development server 🙂
      If anyone knows how to get this to work with WireHttp, please let me know. Thank you.
    • By Anders
      I want to allow full text search on my site. There is a very nice solution that comes right out of the box:
      $selector = "title|body~=$q, limit=50"; This works, but to make it even better I would want to give higher weight to pages where the search term occurs in the title, than if it just occurs in the body. After all, a page with the title "Wine from France" is probably the best match for the search "france wine". How do I accomplish this in ProcessWire?
      I can see three possible paths, but I am not very fond of any of them:
      Do a direct SQL query, circumventing the API, along these lines. But I would prefer to abstract away the database layout if at all possible. Use something like ElasticSearch, but to be honest that would be to complicated to set up and maintain in the long run. Make multiple lookups, first for matches in the title, then for matches in the body, and merge and sort in PHP. My suspicion is that this would get complicated quite quickly. For instance, how do you deal with a page that has two of the three search terms in the title and the third in the body? Is there a magic option four I should look into? Or are any of the above options better than the others? Any input is welcome!
    • By jds43
      Hello,
      I have a search page loosely based on Skyscrapers where I'm parsing a selector with options 'beds', 'bathrooms', 'size' fields. It is working well until I select 'Any' after I've run a search. This is where no results are returned (/?beds=&bathrooms=&size=&submit=). I want it to reset and show all results.
      I hope this isn't too vague.
       
    • By Robin S
      If you've ever needed to insert links to a large number of files within CKEditor you may have found that the standard PW link modal is a somewhat slow way to do it.
      This module provides a quicker way to insert links to files on the page being edited. You can insert a link to an individual file, or insert an unordered list of links to all files on the page with a single click.
      CKEditor Link Files
      Adds a menu to CKEditor to allow the quick insertion of links to files on the page being edited.

      Features
      Hover a menu item to see the "Description" of the corresponding file (if present). Click a menu item to insert a link to the corresponding file at the current cursor position. The filename is used as the link text. If you Alt-click a menu item the file description is used as the link text (with fallback to filename if no description entered). If text is currently selected in the editor then the selected text is used as the link text. Click "* Insert links to all files *" to insert an unordered list of links to all files on the page. Also works with the Alt-click option. Menu is built via AJAX so newly uploaded files are included in the menu without the page needing to be saved. However, descriptions are not available for newly uploaded files until the page is saved. There is an option in the module config to include files from Repeater fields in the edited page. Nested Repeater fields (files inside a Repeater inside another Repeater) are not supported. Installation
      Install the CKEditor Link Files module.
      For any CKEditor field where you want the "Insert link to file" dropdown menu to appear in the CKEditor toolbar, visit the field settings and add "LinkFilesMenu" to the "CKEditor Toolbar" settings field.
       
      http://modules.processwire.com/modules/cke-link-files/
      https://github.com/Toutouwai/CkeLinkFiles
×
×
  • Create New...