Jump to content
teppo

SearchEngine

Recommended Posts

@teppo I don't think this is to with my setup. I have tried 3 x different character sets and encodings, all throwing errors. In particular... if I dump the result of processIndex() just before returning it, with that "à" character it is getting encoded as \xc3 when it should be \xC3\xA0. You can check this here: https://mothereff.in/utf-8.

I don't understand enough about how you are prepping the data before saving the field, but this I think is an issue with multibyte substrings.

Share this post


Link to post
Share on other sites

Sorry for the spam, but found a solution. I am not an expert on prepping strings for the database, but replace the line here with the below to make it unicode aware fixes things for me:

$processed_index = preg_replace('/\s+/u', ' ', $processed_index);

 

  • Like 1

Share this post


Link to post
Share on other sites

 

1 hour ago, teppo said:

I'll take a closer look at this ASAP 🙂

Coolio, no rush. I think this might have also been to do with having html entity encoders set on those ckeditor fields. Have no idea why I did that, maybe I did it testing something else.

  • Like 1

Share this post


Link to post
Share on other sites
21 hours ago, Mikie said:

@teppo I don't think this is to with my setup. I have tried 3 x different character sets and encodings, all throwing errors. In particular... if I dump the result of processIndex() just before returning it, with that "à" character it is getting encoded as \xc3 when it should be \xC3\xA0. You can check this here: https://mothereff.in/utf-8.

I don't understand enough about how you are prepping the data before saving the field, but this I think is an issue with multibyte substrings.

Heya!

I've looked a bit into this, but to be honest I'd like to gain a better understanding of the situation before applying the fix. Any chance you could check the charset and collation of the field_search_index table (assuming search_index is your search index field)? The output of "SHOW FULL COLUMNS FROM field_search_index" should be enough.

The "u" modifier for preg_replace() does some things I'm slightly worried about, i.e. it's documented as "not compatible with Perl", it changes how matches are treated, and it may also result in warnings if the subject string invalid UTF-8 — so at the very least it may require a bit of extra validation as well to account for that. Before going there I'd like to figure out how to reproduce this issue first. I've tried all sorts of special characters with no luck, so far everything has worked just fine here 🙂

Also, when you say that the ""à" character it is getting encoded as \xc3 when it should be \xC3\xA0", what do you mean exactly? I mean... do you literally see \xc3 somewhere, or do I have to grab the value and pass it through some sort of inspection process to see that it's wrong? If I dump the result of processIndex(), I see "à" character on the screen, and that's also what's being stored in the database.

Sorry, I'm easily confused when it comes to things like character sets etc. 😅

Edit: forgot to mention that based on StackOverflow this definitely looks like a character set issue, i.e. typical case where this error occurs is when you're trying to store UTF-8 data into a latin1 table. Assuming that the CKEditor field in question is some form of UTF-8, the index field data column should definitely also be UTF-8 — and if it's not, that sounds really weird.

Edited by teppo
  • Like 1

Share this post


Link to post
Share on other sites
On 2/6/2020 at 12:11 AM, Mikie said:

Cheers! I am storing magazine style story credits (role, name, website url etc) in the Table. I feel that since Table only accepts text based fields this is an ok candidate for indexing. Can try to hack away at your module myself for now, no rush.

Table field is now one of the supported fieldtypes in SearchEngine 0.17.0. The indexing part makes use of TableRows::render(); I may have to revisit this at some point, but this approach seemed to work quite well in my initial tests, and this way I don't have to identify each possible value but can rather let the fieldtype do all the heavy lifting 🙂

  • Like 2

Share this post


Link to post
Share on other sites
8 hours ago, teppo said:

Table field is now one of the supported fieldtypes in SearchEngine 0.17.0. The indexing part makes use of TableRows::render(); I may have to revisit this at some point, but this approach seemed to work quite well in my initial tests, and this way I don't have to identify each possible value but can rather let the fieldtype do all the heavy lifting 🙂

I've edited this reply, since I double-checked and it is happening both with and without entity encoders active on CKEditor fields when trying to save the search index. 

See screenshots below, with the text Testing “testing” à 123 in a ckeditor field. Strange quotes get converted to utf-8 from html encoding, but the "à" symbol utf-8 gets clipped in half. Can confirm PW / DB / db table / db column all using utf8mb4 + unicode_ci (learnt alot about this stuff past few days!).

EB44487B-4E41-4581-B3B4-CD3E5FED313E.png.e4776af794ec130f505ffb801c4cd896.png

CF38A96B-ABCC-472A-B03E-9CC595C4D589.thumb.png.83ff9029aa80c41386b409ac18b618cf.png

B1FD22F1-0268-4A4A-BDAC-2F4753660148.png.3f11ed59793713dd8a7b77f5b50e2b7c.png

1F931F58-8F1A-42CF-99D8-EE625DB30669.png.a7c2d1c33dbc173e7c00968ddf4ffe89.png

  • Like 1

Share this post


Link to post
Share on other sites

Here is a dump adding that unicode aware regex tag:

Sorry about the line numbers in Tracey output, I had hacked your module to add tables 😏

0723C332-E2BE-48CE-8C43-974B77299F1A.png.ffe9772e743a4cc8aaadd43e71e3f6c8.png

  • Like 1

Share this post


Link to post
Share on other sites

Thanks, I'll try to dedicate this a bit more time later today. I'm still confused as for why it's happening (can only assume that there's some difference in the environment), but perhaps the "u" flag indeed is the correct fix. Will have to check that it doesn't cause additional issues in cases where the module is now working as expected... 🙂

Share this post


Link to post
Share on other sites

Yeah I wouldnt worry about it too much. If you can’t replicate your end it’s definitely a problem with the environment. Will try to figure out from my end.

Share this post


Link to post
Share on other sites
12 minutes ago, Mikie said:

Here is the exact issue. Set locale seems to be the problem, when combined with preg_replace on white-space... 

https://github.com/silverstripe/silverstripe-framework/issues/7132

 

7 minutes ago, Mikie said:

And it only happens on mac as well! Wow, what an edge case.

Awesome — thanks for digging these out! 🙂

  • Like 1

Share this post


Link to post
Share on other sites
6 minutes ago, teppo said:

Awesome — thanks for digging these out! 🙂

No worries! Can confirm had setlocale(LC_ALL, 'en_US.UTF-8'); in my site config. I only do this when PW tells me to, haven't taken the time to even understand why. Turning that off fixed the issue also.

There's enough discussion within that silverstripe GitHub issue about the alternatives. Very very edge, will leave up to you!

  • Like 2

Share this post


Link to post
Share on other sites

Aforementioned issue should be fixed now. As was already mentioned above, this could only be replicated under specific circumstances on macOS; nevertheless it seems that defining the "u" flag for preg_replace() is a relatively safe thing to do, so I've gone ahead and done that. If it ends up causing trouble, I may have to reconsider that, but at least for now it seems to be all good 🙂

Thanks @Mikie for tracking this down!

  • Like 1

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

  • Similar Content

    • By MoritzLost
      This is a new module that provides a simple solution to clearing all your cache layers at once, and an extensible interface to perform various cache-related actions.
      The simple motivation behind this module was that I was tired of manually clearing caches in several places after deploying a change on a live site. The basic purpose of this module is a simple Clear all caches link in the Setup menu which clears out all caches, no matter where they hide. You can customize what exactly the module does through it's configuration menu:
      Expire or delete all cache entries in the database, or selectively clear caches by namespace ($cache API) Clear the the template render cache. Clear out specific folders inside your site's cache directory (/site/assets/cache) Refresh version strings for static assets to bust client-side browser caches (this requires some setup, see the full documentation for details). This is the basic function of the module. However, you can also add different cache management action through the API and execute them through the module's interface. For this advanced usage, the module provides:
      An interface to see all available cache actions and execute them. A system log and logging output on the module page to see verify what the module is doing. A CacheControlTools class with utility functions to clear out different caches. An API to add cache actions, execute them programmatically and even modify the default action. Permission management, allowing you granular control over which user roles can execute which actions. The complete documentation can be found in the module's README.
      Beta release
      Note that I consider this a Beta release. Since the module is relatively aggressive in deleting some caches, I would advise you to install in on a test environment before using it on a live site.
      Let me know if you're getting any errors, have trouble using the module or if you have suggestions for improvement!
      In particular, can someone let me know if this module causes any problems with the ProCache module? I don't own or use it, so I can't check. As far as I can tell, ProCache uses a folder inside the cache directory to cache static pages, so my module should be able to clear the ProCache site cache as well, I'd appreciate it if someone can test that for me.
      Future plans
      If there is some interest in this, I plan to expand this to a more general cache management solution. I particular, I would like to add additional cache actions. Some ideas that came to mind:
      Warming up the template render cache for publicly accessible pages. Removing all active user sessions. Let me know if you have more suggestions!
      Links
      https://github.com/MoritzLost/ProcessCacheControl ProcessCacheControl in the Module directory

    • By joshua
      This module is (yet another) way for implementing a cookie management solution.
      Of course there are several other possibilities:
      - https://processwire.com/talk/topic/22920-klaro-cookie-consent-manager/
      - https://github.com/webmanufaktur/CookieManagementBanner
      - https://github.com/johannesdachsel/cookiemonster
      - https://www.oiljs.org/
      - ... and so on ...
      In this module you can configure which kind of cookie categories you want to manage:

      You can also enable the support for respecting the Do-Not-Track (DNT) header to don't annoy users, who already decided for all their browsing experience.
      Currently there are four possible cookie groups:
      - Necessary (always enabled)
      - Statistics
      - Marketing
      - External Media
      All groups can be renamed, so feel free to use other cookie group names. I just haven't found a way to implement a "repeater like" field as configurable module field ...
      When you want to load specific scripts ( like Google Analytics, Google Maps, ...) only after the user's content to this specific category of cookies, just use the following script syntax:
      <script type="optin" data-type="text/javascript" data-category="statistics" data-src="/path/to/your/statistic/script.js"></script> <script type="optin" data-type="text/javascript" data-category="marketing" data-src="/path/to/your/mareketing/script.js"></script> <script type="optin" data-type="text/javascript" data-category="external_media" data-src="/path/to/your/external-media/script.js"></script> <script type="optin" data-type="text/javascript" data-category="marketing">console.log("Inline scripts are also working!");</script> The type has to be "optin" to get recognized by PrivacyWire, the data-attributes are giving hints, how the script shall be loaded, if the data-category is within the cookie consents of the user. These scripts are loaded asynchronously after the user made the decision.
      If you want to give the users the possibility to change their consent, you can use the following Textformatter:
      [[privacywire-choose-cookies]] It's planned to add also other Textformatters to opt-out of specific cookie groups or delete the whole consent cookie.
      You can also add a custom link to output the banner again with a link / button with following class:
      <a href="#" class="privacywire-show-options">Show Cookie Options</a> <button class="privacywire-show-options">Show Cookie Options</button> This module is still in development, but we already use it on several production websites.
      You find it here: https://github.com/blaueQuelle/privacywire/tree/master
      Download: https://github.com/blaueQuelle/privacywire/archive/master.zip
      I would love to hear your feedback 🙂
      Edit: Updated URLs to master tree of git repo
       
    • By David Karich
      Admin Page Tree Multiple Sorting
      ClassName: ProcessPageListMultipleSorting
      Extend the ordinary sort of children of a template in the admin page tree with multiple properties. For each template, you can define your own rule. Write each template (template-name) in a row, followed by a colon and then the additional field names for sorting.
      Example: All children of the template "blog" to be sorted in descending order according to the date of creation, then descending by modification date, and then by title. Type:
      blog: -created, -modified, title  Installation
      Copy the files for this module to /site/modules/ProcessPageListMultipleSorting/ In admin: Modules > Check for new modules. Install Module "Admin Page Tree Multible Sorting". Alternative in ProcessWire 2.4+
      Login to ProcessWire backend and go to Modules Click tab "New" and enter Module Class Name: "ProcessPageListMultipleSorting" Click "Download and Install"   Compatibility   I have currently tested the module only under PW 2.6+, but think that it works on older versions too. Maybe someone can give a feedback.     Download   PW-Repo: http://modules.processwire.com/modules/process-page-list-multiple-sorting/ GitHub: https://github.com/FlipZoomMedia/Processwire-ProcessPageListMultipleSorting     I hope someone can use the module. Have fun and best regards, David
    • By dimitrios
      Hello,
      this module can publish content of a Processwire page on a Facebook page, triggered by saving the Processwire page.
      To set it up, configure the module with a Facebook app ID, secret and a Page ID. Following is additional configuration on Facebook for developers:
      Minimum Required Facebook App configuration:
      on Settings -> Basics, provide the App Domains, provide the Site URL, on Settings -> Advanced, set the API version (has been tested up to v3.3), add Product: Facebook Login, on Facebook Login -> Settings, set Client OAuth Login: Yes, set Web OAuth Login: Yes, set Enforce HTTPS: Yes, add "https://www.example.com/processwire/page/" to field Valid OAuth Redirect URIs. This module is configurable as follows:
      Templates: posts can take place only for pages with the defined templates. On/Off switch: specify a checkbox field that will not allow the post if checked. Specify a message and/or an image for the post.
      Usage
      edit the desired PW page and save; it will post right after the initial Facebook log in and permission granting. After that, an access token is kept.
       
      Download
      PW module directory: http://modules.processwire.com/modules/auto-fb-post/ Github: https://github.com/kastrind/AutoFbPost   Note: Facebook SDK for PHP is utilized.


×
×
  • Create New...