Jump to content
Jonathan Dart

ElasticSearch for ProcessWire

Recommended Posts

ElasticSearch does a lot, but the part that is most interesting to me is that it does an amazing job of fulltext search. It's also crazy fast. It can be a bit scary at first so hopefully this module will make it more accessible.

I threw together this module pretty quickly, it's more of a proof of concept than anything else at this point. I tried it out on a site with 400 bilingual pages and the search results are much improved over the normal search you would get doing like queries or fulltext queries in mysql.

Github page: https://github.com/jdart/ElasticSearchProcessWire

I'd love to hear some feedback on how it works for you. 

It's very new so expect bugs, in particular the mechanism that turns pages into data to be indexed by ES might have some surprises.

Edited by Nico Knoll
Added the "module" tag.
  • Like 21

Share this post


Link to post
Share on other sites

I obviously need to read up on ElasticSearch some more, but this sounds pretty cool - thanks!

  • Like 1

Share this post


Link to post
Share on other sites

Awesome! I'm going to give this a try as I was just thinking about incorporating ElasticSearch into an app I'm putting together. I'll let you know how it works out.

  • Like 1

Share this post


Link to post
Share on other sites

I've updated the module to support typical pw style pagination:

$search_results = $modules->get('ElasticSearch')->search('foo bar', $results_per_page); 
echo "Total results: " . $search_results->getTotal();
echo $search_results->renderPagination();
  • Like 4

Share this post


Link to post
Share on other sites

Wow. "As usual" this post was made at the right time. We're currently building two projects that need some advanced search mechanisms. We thought about using SolR or Elasticsearch. I would propably have gone with SolR as we've used it in other projects before. This module will make our decision a lot easier :D

Share this post


Link to post
Share on other sites

FWIW ElasticSearch is a layer on top of Apache Solr. I've tried using Solr and it's much harder to use. ElasticSearch is like Solr + magic.

Edit: Both Solr and ElasticSearch are built on the Lucene search engine

Share this post


Link to post
Share on other sites

Hi Jonathan,

I just installed the module and tried to perform the initial indexing. But I got the following nesting-level-exceeded error (using win/php5.4.6)

( ! ) Fatal error: Maximum function nesting level of '400' reached, aborting! in ...\wire\core\Template.php on line 206
Call Stack
#    Time    Memory    Function    Location
1    0.0013    164624    {main}( )    ..\index.php:0
2    0.2451    12483568    ProcessPageView->execute( )    ..\index.php:195
3    0.2451    12483680    Wire->__call( )    ..\index.php:195
4    0.2451    12483680    Wire->runHooks( )    ..\Wire.php:317
5    0.2452    12485280    call_user_func_array ( )    ..\Wire.php:359
6    0.2452    12485376    ProcessPageView->___execute( )    ..\Wire.php:359
7    0.2555    12579304    Page->render( )    ..\ProcessPageView.module:167
8    0.2555    12579416    Wire->__call( )    ..\ProcessPageView.module:167
9    0.2555    12579416    Wire->runHooks( )    ..\Wire.php:317
10    0.3390    13117808    ElasticSearch->checkForRebuildSearchData( )    ..\Wire.php:381
11    0.4550    13572320    ElasticSearch->updatePageContentInElasticSearch( )    ..\ElasticSearch.module:127
12    0.4574    13586960    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:209
13    0.4684    13640000    ElasticSearch->getRepeaterTypeAsContent( )    ..\ElasticSearch.module:149
14    0.4684    13640880    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:198
15    0.4759    13942032    ElasticSearch->getPageTypeAsContent( )    ..\ElasticSearch.module:147
16    0.4759    13942048    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:190
17    0.4760    13943400    ElasticSearch->getRepeaterTypeAsContent( )    ..\ElasticSearch.module:149
18    0.4760    13944240    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:198
19    0.4761    13945496    ElasticSearch->getPageTypeAsContent( )    ..\ElasticSearch.module:147
20    0.4761    13945496    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:190
21    0.4762    13946848    ElasticSearch->getRepeaterTypeAsContent( )    ..\ElasticSearch.module:149
22    0.4762    13947688    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:198
23    0.4764    13948944    ElasticSearch->getPageTypeAsContent( )    ..\ElasticSearch.module:147
24    0.4764    13948944    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:190
25    0.4764    13950296    ElasticSearch->getRepeaterTypeAsContent( )    ..\ElasticSearch.module:149
26    0.4764    13951136    ElasticSearch->getAllContentForPage( )    ..\ElasticSearch.module:198
27    0.4766    13952400    ElasticSearch->getPageTypeAsContent( )    ..\ElasticSearch.module:147

....

I think there must be a problem with recursions of page and/or repeater fields. Did you experience something like this? Is there a patch for the module that prevents this recursion type effects?

regards,

Marco

Share this post


Link to post
Share on other sites

Yes, I'm running xdebug (and already increased the nesting level to 400). But as the error message shows, there is an endless loop in function calls. So increasing the nesting level won't help.

Share this post


Link to post
Share on other sites

You could be right but just for fun, have you tried setting it to let's say 1000 or even disabling xdebug and see if it runs?

Share this post


Link to post
Share on other sites

Nesting level of 1000 didn't help. Deactivating the xdebug extension led to an memory exhaustion error (as expected).

This is an endless recursion problem (I think) and therefore cannot be solved by any type of php confoiguration.

A possible solution could be to limit indexing the actual text fields, especially ignoring fields that reference other pages to prevent circular references.

  • Like 2

Share this post


Link to post
Share on other sites

Hi Marco, I'm not sure what might be the issue, I'll check it out asap

Nesting level of 1000 didn't help. Deactivating the xdebug extension led to an memory exhaustion error (as expected).

This is an endless recursion problem (I think) and therefore cannot be solved by any type of php confoiguration.

A possible solution could be to limit indexing the actual text fields, especially ignoring fields that reference other pages to prevent circular references.

Share this post


Link to post
Share on other sites

Hi Marco,

In ElasticSearch.module can you try changing the below function (around line 190):

protected function getPageTypeAsContent($value) {
    return $this->getAllContentForPage($value);
}

to:

protected function getPageTypeAsContent($value) {
    return $value->title;
}

Let me know if that gets rid of the nesting issue, and if search results are affected.

Thanks

Nesting level of 1000 didn't help. Deactivating the xdebug extension led to an memory exhaustion error (as expected).

This is an endless recursion problem (I think) and therefore cannot be solved by any type of php confoiguration.

A possible solution could be to limit indexing the actual text fields, especially ignoring fields that reference other pages to prevent circular references.

  • Like 1

Share this post


Link to post
Share on other sites

Hi Jonathan,

I added your little patch, and it helped preventing the recursion problems.

The site content has been indexed.

  • Like 1

Share this post


Link to post
Share on other sites

Hi Jonathan, 

No doubts, its a good module, as I was looking for something same. But I am facing an issue with pagination while using the results from ElasticSearch module. It always highlights the First Page on Pagination, otherwise records are displaying perfectly right as those should be. For example, if I go to Page 3, using the pagination, Search Results are appearing for Page 3 but "Page 1" is still highlighted on pagination. This is how I have rendered the pager.

echo $search_results->renderPager();

Any help in this regard will be much appreciated.

Thanks.

Share this post


Link to post
Share on other sites

Okay, so i figured it out, basically there was need to set the "Start" in the PageArray and that was missing in this module. I have added the following code

$pages->setStart($from);

right after the

$pages->setLimit($size);

at line # 372 in ElasticSearch.module file, and this fixed my issue.

  • Like 2

Share this post


Link to post
Share on other sites

Out of curiosity has anyone tested this with 2.5? I am wondering if there is an issue with the module or my configuration, im not seeming to get any results when I index.

Share this post


Link to post
Share on other sites

I'm using it in 2.5 and so is my co-worker.

what gets me when setting up the config. 
input ip port ->click submit

once page reloads then click index all pages.

Though, we recently found some bugs with it including hidden pages but its working fine with some alternations.

basic use create a search page 

/search/?q=test

<?php if ($q = $sanitizer->selectorValue($input->get->q)) {
 $input->whitelist('q', $q);
 $matches = $modules->get("ElasticSearch")->search($q, 25); 
 foreach($matches as $key => $match) 
 { 
  if ($match->isHidden())
   $matches->remove($key); 		
 }
}
?php>

<?php if ( ! $q): ?>
Type something.
<?php elseif ($matches->count()): ?>

 <?php foreach ($matches as $m): ?>
  <a href='<?php echo $m->url ?>'><?php echo $m->title ?></a>
 <?php endforeach ?>

<?php else: ?>
  no results found
<?php endif ?>

Share this post


Link to post
Share on other sites

Hello Jonathan is it possible search text inside attachments ???

thank you

Share this post


Link to post
Share on other sites

hello adrian

i was looking for this, but i couldn't find it.

Thank you

Share this post


Link to post
Share on other sites

I'm trying this module out and could use some troubleshooting tips.

Java and Elastic Search are installed. I'm forwarding port 9200 through to the virtual machine. I ran "sudo /etc/init.d/elasticsearch start" and if I try to access the site's domain using port 9200 I do get a response:

{"status": 200,
"name": "Conquest",
"cluster_name": "elasticsearch",
"version":
{"number": "1.5.1",
"build_hash": "5e38401bc4e4388537a615569ac60925788e1cf4",
"build_timestamp": "2015-04-09T13:41:35Z",
"build_snapshot": false,
"lucene_version": "4.10.4"},
"tagline": "You Know, for Search"}

I went with the default module settings for host and port and chose a template which has just 10 pages. When I click to index all pages I get this error:

Error: Maximum execution time of 30 seconds exceeded (line 617 of /web/elastic/wire/core/Page.php)

I'd think 30 seconds would be quite adequate for 10 pages so I'm wondering what I can do to diagnose the problem.

Tried it with the max execution time at 60sec and it timed out again.

Error: Maximum execution time of 60 seconds exceeded (line 622 of /web/elastic/wire/core/Page.php)

FYI: I'm using the dev branch (2.5.26) running Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-39-generic x86_64) in a virtual machine on my PC.

Thanks!

Share this post


Link to post
Share on other sites

Elastic Search itself was okay. Here's what I found.

Timeout while indexing:

The module's code for indexing all pages does a find and I'd assumed it would make use of the template whitelist value from module configuration but it didn't. It finds lots of pages, then skips the ones which should not be indexed. I have thousands of simple pages (containers for images) which don't need to be found by this selector. Now I'm using the whitelist to build a more specific selector. May have to break this up into multiple finds when I have more content.

In checkForRebuildSearchData()

		$arr = $this->getAllowedTemplates();
		$str = (count($arr)) ? ' template='.implode('|', $arr).',' : '';
		$pages = $this->pages->find("id!=2, id!=7, has_parent!=2, has_parent!=7, template!=admin,$str include=all");

The other thing that became obvious pretty quickly is that the Textareas (with an s) fieldtype was not handled. Adding a function and a line to use it in getAllContentForPage() took care of that.

    protected function getTextareasTypeAsContent($value)    {
        $values = array();
        foreach ($value as $name=>$value) {
            $values[$name] = $value;
        }
        return $values;
    }    

...

			elseif ($type instanceof FieldtypeTextareas)
				$value = $this->getTextareasTypeAsContent($value);

I've confirmed that it is picking up changes when I edit pages. Too early for opinions on effectiveness of Elastic Search itself.

  • Like 2

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By bernhard
      --- Please use RockFinder3 ---
    • By MoritzLost
      Cacheable Placeholders
      This module allows you to have pieces of dynamic content inside cached output. This aims to solve the common problem of having a mostly cacheable site, but with pieces of dynamic output here and there.  Consider this simple example, where you want to output a custom greeting to the current user:
      <h1>Good morning, <?= ucfirst($user->name) ?></h1> This snippet means you can't use the template cache (at least for logged-in users), because each user has a different name. Even if 99% of your output is static, you can only cache the pieces that you know won't include this personal greeting. A more common example would be CSRF tokens for HTML forms - those need to be unique by definition, so you can't cache the form wholesale.
      This module solves this problem by introducing cacheable placeholders - small placeholder tokens that get replaced during every request. The replacement is done inside a Page::render hook so it runs during every request, even if the response is served from the template cache. So you can use something like this:
      <h1>Good morning, {{{greeting}}}</h1> Replacement tokens are defined with a callback function that produces the appropriate output and added to the module through a simple hook:
      // site/ready.php wire()->addHookAfter('CachePlaceholders::getTokens', function (HookEvent $e) { $tokens = $e->return; $tokens['greeting'] = [ 'callback' => function (array $tokenData) { return ucfirst(wire('user')->name); } ]; $e->return = $tokens; }); Tokens can also include parameters that are parsed and passed to the callback function. There are more fully annotated examples and step-by-step instructions in the README on Github!
      Features
      A simple and fast token parser that calls the appropriate callback and runs automatically. Tokens may include multiple named or positional parameters, as well as multi-value parameters. A manual mode that allows you to replace tokens in custom pieces of cached content (useful if you're using the $cache API). Some built-in tokens for common use-cases: CSRF-Tokens, replacing values from superglobals and producing random hexadecimal strings. The token format is completely customizable, all delimiters can be changed to avoid collisions with existing tag parsers or template languages. Links
      Github Repository & documentation Module directory (pending approval) If you are interested in learning more, the README is very extensive, with more usage examples, code samples and usage instructions!
    • By Craig
      I've been using Fathom Analytics for a while now and on a growing number of sites, so thought it was about time there was a PW module for it.
      WayFathomAnalytics
      WayFathomAnalytics is a group of modules which will allow you to view your Fathom Analytics dashboard in the PW admin panel and (optionally) automatically add and configure the tracking code on front-end pages.
      Links
      GitHub Readme & documentation Download Zip Modules directory Module settings screenshot What is Fathom Analytics?
      Fathom Analytics is a simple, privacy-focused website analytics tool for bloggers and businesses.

      Stop scrolling through pages of reports and collecting gobs of personal data about your visitors, both of which you probably don't need. Fathom is a simple and private website analytics platform that lets you focus on what's important: your business.
      Privacy focused Fast-loading dashboards, all data is on a single screen Easy to get what you need, no training required Unlimited email reports Private or public dashboard sharing Cookie notices not required (it doesn't use cookies or collect personal data) Displays: top content, top referrers, top goals and more
    • By daniels
      This is a lightweight alternative to other newsletter & newsletter-subscription modules.
      You can find the Module in the Modules directory and on Github
      It can subscribe, update, unsubscribe & delete a user in a list in Mailchimp with MailChimp API 3.0. It does not provide any forms or validation, so you can feel free to use your own. To protect your users, it does not save any user data in logs or sends them to an admin.
      This module fits your needs if you...
      ...use Mailchimp as your newsletter / email-automation tool ...want to let users subscribe to your newsletter on your website ...want to use your own form, validation and messages (with or without the wire forms) ...don't want any personal user data saved in any way in your ProcessWire environment (cf. EU data regulation terms) ...like to subscribe, update, unsubscribe or delete users to/from different lists ...like the Mailchimp UI for creating / sending / reviewing email campaigns *I have only tested it with PHP 7.x so far, so use on owners risk
      EDIT:
      Since 0.0.4, instructions and changelog can be found in the README only. You can find it here  🙂
      If you have questions or like to contribute, just post a reply or create an issue or pr on github, thanks!
    • By MoritzLost
      Sorry for the convoluted title. I have a problem with Process modules that define a custom page using the page key through getModuleInfo (as demonstrated in this excellent tutorial by @bernhard). Those pages are created automatically when the module is installed. The problem is that the title of the page only gets set in the current language. That's not a problem if the current language (language of the superuser who is installing the module) is the default language; if it isn't, the Process page is missing a title in the default language. This has the very awkward effect that a user using the backend in the default language (or any other language) will see an empty entry in the setup menu:

      This screenshot comes from my Cache Control module which includes a Process page. Now I realize the description sounds obscure, but for us it's a common setup: We a multiple bilingual sites where the default language is German and the second language is English. While the clients use the CMS in German, as a developer I prefer the English interface, so whenever I install a Process module I get this problem.
      As a module author, is there a way to handle this situation? I guess it would be possible to use post-installation hooks or create the pages manually, but I very much prefer the declarative approach. The page title is already translatable (through the __ function), but of course at the time of installation there is no translation, and as far as I'm aware it's not possible to ship translations with a module so they are used automatically. Could this situation be handled better in the core? I would prefer if the module installation process would always set the title of the Process page in the default language, instead of the language of the current user.
×
×
  • Create New...