Jump to content
teppo

SearchEngine

Recommended Posts

@teppo I don't think this is to with my setup. I have tried 3 x different character sets and encodings, all throwing errors. In particular... if I dump the result of processIndex() just before returning it, with that "à" character it is getting encoded as \xc3 when it should be \xC3\xA0. You can check this here: https://mothereff.in/utf-8.

I don't understand enough about how you are prepping the data before saving the field, but this I think is an issue with multibyte substrings.

Share this post


Link to post
Share on other sites

Sorry for the spam, but found a solution. I am not an expert on prepping strings for the database, but replace the line here with the below to make it unicode aware fixes things for me:

$processed_index = preg_replace('/\s+/u', ' ', $processed_index);

 

  • Like 1

Share this post


Link to post
Share on other sites

Thanks @Mikie! I'll take a closer look at this ASAP 🙂

  • Like 2

Share this post


Link to post
Share on other sites

 

1 hour ago, teppo said:

I'll take a closer look at this ASAP 🙂

Coolio, no rush. I think this might have also been to do with having html entity encoders set on those ckeditor fields. Have no idea why I did that, maybe I did it testing something else.

  • Like 1

Share this post


Link to post
Share on other sites
21 hours ago, Mikie said:

@teppo I don't think this is to with my setup. I have tried 3 x different character sets and encodings, all throwing errors. In particular... if I dump the result of processIndex() just before returning it, with that "à" character it is getting encoded as \xc3 when it should be \xC3\xA0. You can check this here: https://mothereff.in/utf-8.

I don't understand enough about how you are prepping the data before saving the field, but this I think is an issue with multibyte substrings.

Heya!

I've looked a bit into this, but to be honest I'd like to gain a better understanding of the situation before applying the fix. Any chance you could check the charset and collation of the field_search_index table (assuming search_index is your search index field)? The output of "SHOW FULL COLUMNS FROM field_search_index" should be enough.

The "u" modifier for preg_replace() does some things I'm slightly worried about, i.e. it's documented as "not compatible with Perl", it changes how matches are treated, and it may also result in warnings if the subject string invalid UTF-8 — so at the very least it may require a bit of extra validation as well to account for that. Before going there I'd like to figure out how to reproduce this issue first. I've tried all sorts of special characters with no luck, so far everything has worked just fine here 🙂

Also, when you say that the ""à" character it is getting encoded as \xc3 when it should be \xC3\xA0", what do you mean exactly? I mean... do you literally see \xc3 somewhere, or do I have to grab the value and pass it through some sort of inspection process to see that it's wrong? If I dump the result of processIndex(), I see "à" character on the screen, and that's also what's being stored in the database.

Sorry, I'm easily confused when it comes to things like character sets etc. 😅

Edit: forgot to mention that based on StackOverflow this definitely looks like a character set issue, i.e. typical case where this error occurs is when you're trying to store UTF-8 data into a latin1 table. Assuming that the CKEditor field in question is some form of UTF-8, the index field data column should definitely also be UTF-8 — and if it's not, that sounds really weird.

Edited by teppo
  • Like 1

Share this post


Link to post
Share on other sites
On 2/6/2020 at 12:11 AM, Mikie said:

Cheers! I am storing magazine style story credits (role, name, website url etc) in the Table. I feel that since Table only accepts text based fields this is an ok candidate for indexing. Can try to hack away at your module myself for now, no rush.

Table field is now one of the supported fieldtypes in SearchEngine 0.17.0. The indexing part makes use of TableRows::render(); I may have to revisit this at some point, but this approach seemed to work quite well in my initial tests, and this way I don't have to identify each possible value but can rather let the fieldtype do all the heavy lifting 🙂

  • Like 2

Share this post


Link to post
Share on other sites
8 hours ago, teppo said:

Table field is now one of the supported fieldtypes in SearchEngine 0.17.0. The indexing part makes use of TableRows::render(); I may have to revisit this at some point, but this approach seemed to work quite well in my initial tests, and this way I don't have to identify each possible value but can rather let the fieldtype do all the heavy lifting 🙂

I've edited this reply, since I double-checked and it is happening both with and without entity encoders active on CKEditor fields when trying to save the search index. 

See screenshots below, with the text Testing “testing” à 123 in a ckeditor field. Strange quotes get converted to utf-8 from html encoding, but the "à" symbol utf-8 gets clipped in half. Can confirm PW / DB / db table / db column all using utf8mb4 + unicode_ci (learnt alot about this stuff past few days!).

EB44487B-4E41-4581-B3B4-CD3E5FED313E.png.e4776af794ec130f505ffb801c4cd896.png

CF38A96B-ABCC-472A-B03E-9CC595C4D589.thumb.png.83ff9029aa80c41386b409ac18b618cf.png

B1FD22F1-0268-4A4A-BDAC-2F4753660148.png.3f11ed59793713dd8a7b77f5b50e2b7c.png

1F931F58-8F1A-42CF-99D8-EE625DB30669.png.a7c2d1c33dbc173e7c00968ddf4ffe89.png

  • Like 1

Share this post


Link to post
Share on other sites

Here is a dump adding that unicode aware regex tag:

Sorry about the line numbers in Tracey output, I had hacked your module to add tables 😏

0723C332-E2BE-48CE-8C43-974B77299F1A.png.ffe9772e743a4cc8aaadd43e71e3f6c8.png

  • Like 1

Share this post


Link to post
Share on other sites

Thanks, I'll try to dedicate this a bit more time later today. I'm still confused as for why it's happening (can only assume that there's some difference in the environment), but perhaps the "u" flag indeed is the correct fix. Will have to check that it doesn't cause additional issues in cases where the module is now working as expected... 🙂

Share this post


Link to post
Share on other sites

Yeah I wouldnt worry about it too much. If you can’t replicate your end it’s definitely a problem with the environment. Will try to figure out from my end.

Share this post


Link to post
Share on other sites

And it only happens on mac as well! Wow, what an edge case.

  • Like 2

Share this post


Link to post
Share on other sites
12 minutes ago, Mikie said:

Here is the exact issue. Set locale seems to be the problem, when combined with preg_replace on white-space... 

https://github.com/silverstripe/silverstripe-framework/issues/7132

 

7 minutes ago, Mikie said:

And it only happens on mac as well! Wow, what an edge case.

Awesome — thanks for digging these out! 🙂

  • Like 1

Share this post


Link to post
Share on other sites
6 minutes ago, teppo said:

Awesome — thanks for digging these out! 🙂

No worries! Can confirm had setlocale(LC_ALL, 'en_US.UTF-8'); in my site config. I only do this when PW tells me to, haven't taken the time to even understand why. Turning that off fixed the issue also.

There's enough discussion within that silverstripe GitHub issue about the alternatives. Very very edge, will leave up to you!

  • Like 2

Share this post


Link to post
Share on other sites

Aforementioned issue should be fixed now. As was already mentioned above, this could only be replicated under specific circumstances on macOS; nevertheless it seems that defining the "u" flag for preg_replace() is a relatively safe thing to do, so I've gone ahead and done that. If it ends up causing trouble, I may have to reconsider that, but at least for now it seems to be all good 🙂

Thanks @Mikie for tracking this down!

  • Like 2

Share this post


Link to post
Share on other sites

@teppo Fantastic addition! Searching repeater and page reference fields are working great.

Would it be possible to add support for FieldsetPage fields?

 

  • Like 2

Share this post


Link to post
Share on other sites

Hi @teppo,

Thanks a lot for your module!

Whenever I check "Index pages now?" in the module's backend config and save to build/rebuild the index field, PW throws a lenghty error (see attached PNG). I've selected a couple of text/textarea fields to index and included the index field in my templates. Calling

$modules->get('SearchEngine')->indexPages();

from the search template seems to work fine though. Am I making some newbie mistake here or is that an actual bug?

Screenshot_2020-03-30.png

Share this post


Link to post
Share on other sites

Hello @teppo and all,

I currently run the same processwire site on multiple servers with a shared database and shared asset resources.  This has been working fine for years, but we've been using elasticsearch, which has required feeding index updates from our multiple servers to a single elasticsearch index.  It looks to me like this module will eliminate the need for that additional complexity, and here's where I would like someone to correct me if I'm wrong or point out any flaws in my understanding.  I see that the page indexes are updated with a hook upon page save, and the index info will be stored in the database as a page field.  It seems to me that this should work fine in a multi-server environment sharing a single database, as the page save events will only happen once from the server the page is saved on, and the index will be updated for all servers sharing that database.

Short of something extreme like simultaneously running a complete re-index from multiple servers (which probably would still work out ok...), does anyone see any problem with this approach, or see considerations in this scenario that I may be missing?  Your input is appreciated.

Best Regards,
David

Share this post


Link to post
Share on other sites
15 hours ago, CalleRosa40 said:

Whenever I check "Index pages now?" in the module's backend config and save to build/rebuild the index field, PW throws a lenghty error (see attached PNG). I've selected a couple of text/textarea fields to index and included the index field in my templates.

It looks like you're using Hanna Code with one or more of your indexed fields. Is that correct?

Here something is trying to resize a Pageimage object while it's actually a Pageimages object, which usually means that output formatting is off. If so, you could fix this in the Hanna Code snippet itself (by checking for Pageimages and getting the first Pageimage from it). I'll see if there's something I can do to make this work better, but that's the quick fix anyway.

(Assuming I understood the stack trace correctly...)

  • Like 1

Share this post


Link to post
Share on other sites
On 3/30/2020 at 3:21 PM, Confluent Design said:

Hello @teppo and all,

I currently run the same processwire site on multiple servers with a shared database and shared asset resources.  This has been working fine for years, but we've been using elasticsearch, which has required feeding index updates from our multiple servers to a single elasticsearch index.  It looks to me like this module will eliminate the need for that additional complexity, and here's where I would like someone to correct me if I'm wrong or point out any flaws in my understanding.  I see that the page indexes are updated with a hook upon page save, and the index info will be stored in the database as a page field.  It seems to me that this should work fine in a multi-server environment sharing a single database, as the page save events will only happen once from the server the page is saved on, and the index will be updated for all servers sharing that database.

Short of something extreme like simultaneously running a complete re-index from multiple servers (which probably would still work out ok...), does anyone see any problem with this approach, or see considerations in this scenario that I may be missing?  Your input is appreciated.

Best Regards,
David

Any thoughts on this?  @teppo?  If I am doing something wrong in terms of the form or placement of my inquiry, please let me know that as well.  I want to do things the right way.  Thanks for your time, and apologies for any inconvenience.

Best Regards,
David

Share this post


Link to post
Share on other sites

Hey @Confluent Design,

Sorry for the delay. Your message slipped my mind, thanks for pinging me 🙂

Quote

[...] but we've been using elasticsearch, which has required feeding index updates from our multiple servers to a single elasticsearch index.  It looks to me like this module will eliminate the need for that additional complexity [...]

You're correct in that SearchEngine stores its index in a ProcessWire field, so yes — since your database is already shared, there's nothing else to do in that regard; unless I've very much misunderstood something here, it should work right out of the box.

Quote

Short of something extreme like simultaneously running a complete re-index from multiple servers (which probably would still work out ok...), does anyone see any problem with this approach, or see considerations in this scenario that I may be missing?  Your input is appreciated.

From a basic technical point of view, no, I don't see any issues here. It's worth noting, though, that elasticsearch is quite a bit more complex and feature-rich than this little module here. Depending on your needs that might or might not be an issue.

Basically with SearchEngine you get a searchable blob of all page content that can be converted to text without toying around with additional APIs or libraries (so no file data at the moment), and the search itself is done using a simple selector string — no advanced weighting, stemming, etc. I do have a few "advanced" features on my todo list, but at the moment there's no timeline for any of that 🙂

  • Like 2

Share this post


Link to post
Share on other sites

Thank you @teppo!  This was just the feedback I was looking for.  On both counts, that I was on the right track, and that I might be losing some fancy functionality by going away from elasticsearch.  Appreciate all of that.  I'm going to give your module a try as a replacement!

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By d'Hinnisdaël
      Happy new year, everybody 🥬
      I've been sitting on this Dashboard module I made for a client and finally came around to cleaning it up and releasing it to the wider public. This is how it looks.
      ProcessWire Dashboard

      If anyone is interested in trying this out, please go ahead! I'd love to get some feedback on it. If this proves useful and survives some real-world testing, I'll add this to the module directory.
      Download
      You can find the latest release on Github.
      Documentation
      Check out the documentation to get started. This is where you'll find information about included panel types and configuration options.
      Custom Panels
      My goal was to make it really simple to create custom panels. The easiest way to do that is to use the panel type template and have it render a file in your templates folder. This might be enough for 80% of all use cases. For anything more complex (FormBuilder submissions? Comments? Live chat?), you can add new panel types by creating modules that extend the DashboardPanel base class. Check out the documentation on custom panels or take a look at the HelloWorld panel to get started. I'm happy to merge any user-created modules into the main repo if they might be useful to more than a few people.
       Disclaimer
      This is a pre-release version. Please treat it as such — don't install it on production sites. Just making sure 🍇
      Roadmap
      These are the things I'm looking to implement myself at some point. The wishlist is a lot longer, but those are the 80/20 items that I probably won't regret spending time on.
      Improve documentation & add examples ⚙️ Panel types Google Analytics ⚙️ Add new page  🔥 Drafts 🔥 At a glance / Page counter 404s  Layout options Render multiple tabs per panel panel groups with heading and spacing between ✅ panel wrappers as grid item (e.g. stacked notices) ✅ Admin themes support AdminThemeReno and AdminThemeDefault ✅ Shortcuts panel add a table layout with icon, title & summary ✅ Chart panel add default styles for common chart types ✅ load chart data from JS file (currently passed as PHP array) Collection panel support image columns ✅ add buttons: view all & add new ✅
    • By Pip
      Hi everyone!
      I'm trying out the Login/Register module for my site. Noted that the module assigns the newly registered user to login-register role. 
      Once you modify the login-register role's permissions, particularly adding page-edit, the new member role will be set to guest. 
      Thing is I'd like to grant my new users the power to create their own pages. Any advice? 
      Thanks. 
    • By Gadgetto
      SnipWire - Snipcart integration for ProcessWire
      Snipcart is a powerful 3rd party, developer-first HTML/JavaScript shopping cart platform. SnipWire is the missing link between Snipcart and the content management framework ProcessWire.
      With SnipWire, you can quickly turn any ProcessWire site into a Snipcart online shop. The SnipWire plugin helps you to get your store up and running in no time. Detailed knowledge of the Snipcart system is not required.
      SnipWire is free and open source licensed under Mozilla Public License 2.0! A lot of work and effort has gone into development. It would be nice if you could donate an amount to support further development:

      Status update links (inside this thread) for SnipWire development
      2020-07-03 -- SnipWire 0.8.7 (beta) released! Fixes some small bugs and adds an indicator for TEST mode 2020-04-06 -- SnipWire 0.8.6 (beta) released! Adds support for Snipcart subscriptions and also fixes some problems 2020-03-21 -- SnipWire 0.8.5 (beta) released! Improves SnipWires webhooks interface and provides some other fixes and additions 2020-03-03 -- SnipWire 0.8.4 (beta) released! Improves compatibility for Windows based Systems. 2020-03-01 -- SnipWire 0.8.3 (beta) released! The installation and uninstallation process has been heavily revised. 2020-02-08 -- SnipWire 0.8.2 (beta) released! Added a feature to change the cart and catalogue currency by GET, POST or SESSION param 2020-02-03 -- SnipWire 0.8.1 (beta) released! All custom classes moved into their own namespaces. 2020-02-01 -- SnipWire is now available via ProcessWire's module directory! 2020-01-30 -- SnipWire 0.8.0 (beta) first public release! (module just submitted to the PW modules directory) 2020-01-28 -- added Custom Order Fields feature (first SnipWire release version is near!) 2020-01-21 -- Snipcart v3 - when will the new cart system be implemented? 2020-01-19 -- integrated taxes provider finished (+ very flexible shipping taxes handling) 2020-01-14 -- new date range picker, discount editor, order notifiactions, order statuses, and more ... 2019-11-15 -- orders filter, order details, download + resend invoices, refunds 2019-10-18 -- list filters, REST API improvements, new docs platform, and more ... 2019-08-08 -- dashboard interface, currency selector, managing Orders, Customers and Products, Added a WireTabs, refinded caching behavior 2019-06-15 -- taxes provider, shop templates update, multiCURL implementation, and more ... 2019-06-02 -- FieldtypeSnipWireTaxSelector 2019-05-25 -- SnipWire will be free and open source Plugin Key Features
      Fast and simple store setup Full integration of the Snipcart dashboard into the ProcessWire backend (no need to leave the ProcessWire admin area) Browse and manage orders, customers, discounts, abandoned carts, and more Multi currency support Custom order and cart fields Process refunds and send customer notifications from within the ProcessWire backend Process Abandoned Carts + sending messages to customers from within the ProcessWire backend Complete Snipcart webhooks integration (all events are hookable via ProcessWire hooks) Integrated taxes provider (which is more flexible then Snipcart own provider) Useful Links
      SnipWire in PW modules directory SnipWire Docs (please note that the documentation is a work in progress) SnipWire @GitHub (feature requests and suggestions for improvement are welcome - I also accept pull requests) Snipcart Website  

       
      ---- INITIAL POST FROM 2019-05-25 ----
       
    • By Sten
      Hello
      Till now I hacked something with the twig template but it works no more with new PW versions so I look forward to create a module. I am working on a site in multiple languages : French, English, Italian, German, Spanish, Portuguese, Hebrew, Russian. The new posts are entered in any language with a field for language. Till now, I got twig files to get the translations with constants defined for each part of the pages.
      So I'd like to create a module to include theses files added according to the url /fr/en/...
      Have you some observations to do before I begin about the direction to take ?
      Thank you
    • By ukyo
      Mystique Module for ProcessWire CMS/CMF
      Github repo : https://github.com/trk/Mystique
      Mystique module allow you to create dynamic fields and store dynamic fields data on database by using a config file.
      Requirements
      ProcessWire 3.0 or newer PHP 7.0 or newer FieldtypeMystique InputfieldMystique Installation
      Install the module from the modules directory:
      Via Composer:
      composer require trk/mystique Via git clone:
      cd your-processwire-project-folder/ cd site/modules/ git clone https://github.com/trk/Mystique.git Module in live reaction with your Mystique config file
      This mean if you remove a field from your config file, field will be removed from edit screen. As you see on youtube video.
      Using Mystique with your module or use different configs path, autoload need to be true for modules
      Default configs path is site/templates/configs/, and your config file name need to start with Mystique. and need to end with .php extension.
      Adding custom path not supporting anymore !
      // Add your custom path inside your module class`init` function, didn't tested outside public function init() { $path = __DIR__ . DIRECTORY_SEPARATOR . 'configs' . DIRECTORY_SEPARATOR; Mystique::add($path); } Mystique module will search site/modules/**/configs/Mystique.*.php and site/templates/Mystique.*.php paths for Mystique config files.
      All config files need to return a PHP ARRAY like examples.
      Usage almost same with ProcessWire Inputfield Api, only difference is set and showIf usage like on example.
      <?php namespace ProcessWire; /** * Resource : testing-mystique */ return [ 'title' => __('Testing Mystique'), 'fields' => [ 'text_field' => [ 'label' => __('You can use short named types'), 'description' => __('In file showIf working like example'), 'notes' => __('Also you can use $input->set() method'), 'type' => 'text', 'showIf' => [ 'another_text' => "=''" ], 'set' => [ 'showCount' => InputfieldText::showCountChars, 'maxlength' => 255 ], 'attr' => [ 'attr-foo' => 'bar', 'attr-bar' => 'foo' ] ], 'another_text' => [ 'label' => __('Another text field (default type is text)') ] ] ]; Example:
      site/templates/configs/Mystique.seo-fields.php <?php namespace ProcessWire; /** * Resource : seo-fields */ return [ 'title' => __('Seo fields'), 'fields' => [ 'window_title' => [ 'label' => __('Window title'), 'type' => Mystique::TEXT, // or InputfieldText 'useLanguages' => true, 'attr' => [ 'placeholder' => __('Enter a window title') ] ], 'navigation_title' => [ 'label' => __('Navigation title'), 'type' => Mystique::TEXT, // or InputfieldText 'useLanguages' => true, 'showIf' => [ 'window_title' => "!=''" ], 'attr' => [ 'placeholder' => __('Enter a navigation title') ] ], 'description' => [ 'label' => __('Description for search engines'), 'type' => Mystique::TEXTAREA, 'useLanguages' => true ], 'page_tpye' => [ 'label' => __('Type'), 'type' => Mystique::SELECT, 'options' => [ 'basic' => __('Basic page'), 'gallery' => __('Gallery'), 'blog' => __('Blog') ] ], 'show_on_nav' => [ 'label' => __('Display this page on navigation'), 'type' => Mystique::CHECKBOX ] ] ]; Searching data on Mystique field is limited. Because, Mystique saving data to database in json format. When you make search for Mystique field, operator not important. Operator will be changed with %= operator.
      Search example
      $navigationPages = pages()->find('my_mystique_field.show_on_nav=1'); $navigationPages = pages()->find('my_mystique_field.page_tpye=gallery');
×
×
  • Create New...