Jump to content

Unicode characters in selector


lenno
 Share

Recommended Posts

Hi,

Our ProcessWire website has a simple site search, which is implemented following the ProcessWire search demo code. It uses the %= operator to get broader results.

$queryResults = $this->pages->find('title%=应用'); // return everything

The problem: It will return ALL pages. The reason is probably that %= uses SQL LIKE and SQL needs to be told that the search string contains unicode characters by adding 'N' as the prefix. At least this is what I think I learned from: http://stackoverflow.com/questions/22156413/how-to-select-rows-with-chinese-japanese-characters

I tried this, but I still get everything:

$queryResults = $this->pages->find("title%=N'应用'"); // return everything

I also tried encoding my chinese text:

$queryResults = $this->pages->find('title%=\u5e94\u7528'); // no results

But now I get zero results. And by the way: The ~= operator also returns zero results:

$queryResults = $this->pages->find('title~=\u5e94\u7528'); // no results

$queryResults = $this->pages->find('title~=应用'); // also no results

I guess I could build my own SQL queries, but we have already build a very nice customized search based on ProcessWire selectors and would really like to stick with them.

Any idea how to solve this?

Many thanks, René

 

Link to comment
Share on other sites

The stackoverflow post relates to Microsoft SQL Server, not MySQL.

Searching for Unicode characters works fine in PW. If your characters get stripped, my bet is that they aren't valid utf8, which means the script itself is in another encoding, and/or the webserver announces a different one.

I just tested by creating a page titled "你好 World". Running

<?php

foreach($pages->find('title%=你好') as $p) {
	echo $p->id . ": " . $p->title . PHP_EOL;
}

returned exactly that one entry as it should. To check, save your test code as a script and set your editor to utf8-encode it. Don't trust the command line. If that works, check the encodign headers of the webserver using the developer console (F12).

  • Like 3
Link to comment
Share on other sites

Thanks BitPoet I will try that tomorrow and report back!

Edit:

So that was unfortunate, for some reason I had some trailing spaces in the chinese search query that messed everything up. Sorry for the confusion. I am getting the expected results now when using %=.

However, ~= still seems to return zero results. Could you try running your above code with the ~= selector? From my understanding, it should return the same results. But in my case, it returns nothing.

Link to comment
Share on other sites

18 hours ago, lenno said:

However, ~= still seems to return zero results. Could you try running your above code with the ~= selector? From my understanding, it should return the same results. But in my case, it returns nothing.

That's because ~= uses fulltext indexes for natural language search, which is limited by its ft_min_word_len setting that defaults to 4 characters (unicode code points). PW's selector operator docs explain that. You could lower that setting, but you have to be aware that MySQL doesn't support proper detection of word boundaries for Asian languages. Recent MySQL versions (>=5.7.6) have the option to let an ngram parser divide sequences of unicode characters into groups (ngrams) of a configured length, but that is also quite a fuzzy thing and may summarily not give the results you want.

Usually, Oracle's recommendation is to use a third-party search engine that comes with a dictionary-based preprocessor, e.g. the Sphinx engine.

  • Like 4
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...