teppo Posted October 14, 2014 Share Posted October 14, 2014 This is just an idea I've been playing with a while: adding support for SOUNDS LIKE operator. I'm not sure how much sense it makes in the larger context, but I'm posting it here nevertheless in case it results in something useful As a bit of a background, in MySQL SOUNDS LIKE is just a shortcut for comparing two strings passed to SOUNDEX() and as Wikipedia kindly explains, "soundex is a phonetic algorithm for indexing names by sound, as pronounced in English". In other words, SOUNDEX() converts a string to an "index" of sorts, and comparing index values of two separate strings answers the question of "do these two sound alike (in English)". SOUNDS LIKE works best with English words, and literally words, since comparing entire sentences is often much less sensible (there's no native "wildcard soundex operator" in MySQL) -- names of people, products, countries, cities, etc. are a good example where one might benefit from SOUNDS LIKE. Cases I've really wished this would've been possible have included especially name and city searches, where people tend to mistype things a lot. Yesterday I decided that this would be a cool experiment, so I applied following hack to the DatabaseQuerySelectFulltext.php and Selector.php, making "€=" the SOUNDS LIKE operator. This is wrong at many levels, but works as a simple proof of concept: diff --git a/wire/core/DatabaseQuerySelectFulltext.php b/wire/core/DatabaseQuerySelectFulltext.php index 9c6064f..421f38a 100644 --- a/wire/core/DatabaseQuerySelectFulltext.php +++ b/wire/core/DatabaseQuerySelectFulltext.php @@ -83,6 +83,12 @@ class DatabaseQuerySelectFulltext extends Wire { $query->where("$tableField LIKE '%$v%'"); // SLOW, but assumed break; + case '€=': + $v = $database->escapeStr($value); + $v = $this->escapeLIKE($v); + $query->where("$tableField SOUNDS LIKE '$v'"); // SLOW, but assumed + break; + case '^=': case '%^=': // match at start using only LIKE (no index) $v = $database->escapeStr($value); diff --git a/wire/core/Selector.php b/wire/core/Selector.php index 31748f9..fccfe8a 100644 --- a/wire/core/Selector.php +++ b/wire/core/Selector.php @@ -204,6 +204,7 @@ abstract class Selector extends WireData { Selectors::addType(SelectorLessThanEqual::getOperator(), 'SelectorLessThanEqual'); Selectors::addType(SelectorContains::getOperator(), 'SelectorContains'); Selectors::addType(SelectorContainsLike::getOperator(), 'SelectorContainsLike'); + Selectors::addType(SelectorSoundsLike::getOperator(), 'SelectorSoundex'); Selectors::addType(SelectorContainsWords::getOperator(), 'SelectorContainsWords'); Selectors::addType(SelectorStarts::getOperator(), 'SelectorStarts'); Selectors::addType(SelectorStartsLike::getOperator(), 'SelectorStartsLike'); @@ -284,6 +285,10 @@ class SelectorContainsLike extends SelectorContains { public static function getOperator() { return '%='; } } +class SelectorSoundsLike extends SelectorContains { + public static function getOperator() { return '€='; } +} + /** * Selector that matches one string value that happens to have all of it's words present in another string value (regardless of individual word location) * Just for fun I've been testing this new operator with a script like this (and using one of the most mistyped names I've seen on this forum as a test subject), to see if it works at all: <?php require 'index.php'; foreach (wire('pages')->find('title€=antti') as $k => $p) { echo $k . ". " . $p->url . "\n"; } // results look something like this: // 0. /anttti/ // 1. /antti/ // 2. /anti/ // 3. /antii/ So, what do you folks think -- is there anything in this that might be worth taking further? Can you think of a better approach to this, such as a custom fieldtype with SOUNDS LIKE support, or something? Anything else? 10 2 Link to comment Share on other sites More sharing options...
sforsman Posted October 14, 2014 Share Posted October 14, 2014 I think it would be great to have pluggable selector operators and/or matchers supported directly in the core. If it did, such behavior - and others - could then be easily extended with community modules. For an example, if I preferred an UDF Levenshtein distance calculator instead of SOUNDEX to implement fuzzy searching, I could then do that. 8 Link to comment Share on other sites More sharing options...
DaveP Posted October 14, 2014 Share Posted October 14, 2014 mySQL fulltext searches 'WITH QUERY EXPANSION' are really fun/rad/useful. (@sforsman I didn't think mySQL had native support for Levenshtein distances without adding modules/triggers/routines etc? PHP does, so post-processing search results that way works.) 2 Link to comment Share on other sites More sharing options...
sforsman Posted October 14, 2014 Share Posted October 14, 2014 (@sforsman I didn't think mySQL had native support for Levenshtein distances without adding modules/triggers/routines etc? PHP does, so post-processing search results that way works.) Hey DaveP! Correct, it doesn't, but if you take a closer look, I was referring to a MySQL UDF. Stored procedures or post-processing with PHP would be awfully slow in comparison, which is the exact reason that forced me to write an UDF implementation back in the days. I can dig it up if you are interested in the subject. (Compiling and linking UDFs with MySQL isn't really as complex as they make it sound.) On a sidenote, Elasticsearch for an example - which is backed by Lucene - would obviously provide even greater speeds for fuzzy searching. However we were talking about something that can be combined with the native PW-selectors (i.e. something that can be sticked inside the resulting SQL query). The purpose of my example was to promote a generic way for hooking custom selector operators to existing matchers. 5 Link to comment Share on other sites More sharing options...
apeisa Posted October 14, 2014 Share Posted October 14, 2014 Teppo, great examples you are using. Next beer is on me! 1 Link to comment Share on other sites More sharing options...
Mats Posted November 10, 2014 Share Posted November 10, 2014 Textpattern has a nice plugin based on this code. Would love to see the same functionality in PW. I tried to make this in to a PW module but didn't come far. Link to comment Share on other sites More sharing options...
teppo Posted November 10, 2014 Author Share Posted November 10, 2014 @Mats, that might be an interesting project, but seems to have one major shortcoming. Since it's something you'd use in your templates (or in a module) instead of at the database, it's going to cause various performance issues; you'll have to fetch entire dataset before performing the fuzzy search, or iterate smaller result sets one by one. They even mention performance on the TXP plugin page as a "known issue" (though there might code-related causes for that too, not just the point at which the fuzzy search is triggered). In either case, I guess this is fine as long as you know you won't have to consider scalability, but for me that's a pretty big no-no ProcessWire does perform in-memory searches too, in which case this would make more sense, but at that point you'll be filtering a pre-fetched list of pages, which kind of misses the point of fuzzy search IMHO. 2 Link to comment Share on other sites More sharing options...
DaveP Posted November 10, 2014 Share Posted November 10, 2014 One approach that can work (and can scale) is using native mySQL - like soundex - to find a subset of results and then applying some php processing to them. For example, @teppo's original results in his opening post, sorted by Levenshtein distance ascending in php, would give very good results very quickly. And with a 'limit' in the selector, you can be sure of scalability. 1 Link to comment Share on other sites More sharing options...
gebeer Posted April 6, 2018 Share Posted April 6, 2018 Hello all, I'd like to revive this thread as I could use such functionality for a project with more than 2000 users round the world. When new users are added I'd like to perform a fuzzy search on text fields firstname and lastname to avoid accidental duplicate entries. @teppo thanks a lot for the code examples. Everything is basically there to perform soundex search. But how would one implement this today without hacking the core? Would that be possible in a module that extends classes Selector and DatabaseQuerySelectFulltext? Or an additional option to $pages->find(), e.g. ['fuzzy' =>true]? I'm not that experienced at OOP yet, so any pointers on how to approach this would be very much appreciated. Link to comment Share on other sites More sharing options...
BitPoet Posted April 6, 2018 Share Posted April 6, 2018 2 hours ago, gebeer said: When new users are added I'd like to perform a fuzzy search on text fields firstname and lastname to avoid accidental duplicate entries. Adds a soundex.field=something search option (so no custom operator) for database queries (so in your case, you'd use a selector like "soundex.firstname=$firstname, soundex.lastname=$lastname): <?php // Add to site/ready.php wire()->addHookAfter("PageFinder::getQueryUnknownField", function(HookEvent $event) { static $n = 0; $fieldName = $event->arguments(0); $data = $event->arguments(1); // We only handle our soundex queries if($fieldName != "soundex") return; if(empty($data["subfield"])) { throw new PageFinderSyntaxException("soundex selector expects the name of the field to search as subfield, e.g. 'soundex.title=something'"); } $selectors = $data["selectors"]; $selector = $data["selector"]; $query = $data["query"]; if(! $selector instanceof SelectorEqual) { throw new PageFinderSyntaxException("Invalid operator. Only '=' is allowed in soundex selectors"); } $subfield = "data"; $fieldName = $data["subfields"]; if(strpos($fieldName, '.') > 0) list($fieldName, $subfield) = explode('.', $fieldName); else $fieldName = $data["subfield"]; $field = wire('fields')->get($fieldName); if(!$field) { return; // does not resolve to a known field } // Initialize some needed variables $database = wire('database'); $table = $database->escapeTable($field->getTable()); $subfield = $database->escapeCol(wire('sanitizer')->fieldName($subfield)); $n++; $tableAlias = "_subquery_{$n}_$table"; $join = "$table AS $tableAlias ON $tableAlias.pages_id=pages.id"; $where = "$tableAlias.$subfield SOUNDS LIKE " . $database->quote($selector->value); $query = $data["query"]; $query->join($join); $query->where($where); $selectors->remove($selector); $event->return = true; }); 8 Link to comment Share on other sites More sharing options...
gebeer Posted April 6, 2018 Share Posted April 6, 2018 @BitPoet Thank you very much for the code! I put it in ready.php and my find logic inside a custom ProcessPageAdd::executeMembersExist hook looks like this $selector = 'template=member'; $firstname = $this->wire('sanitizer')->text($this->wire('input')->get('firstname')); $lastname = $this->wire('sanitizer')->text($this->wire('input')->get('lastname')); if($firstname) $selector .= ", soundex.firstname={$firstname}"; if($lastname) $selector .= ", soundex.lastname={$lastname}"; $selector .= ', include=all'; $users = $this->wire('users')->find($selector); I always get an error: Field does not exist: soundex Full backtrace: ProcessPageAdd: Field does not exist: soundex DEBUG MODE BACKTRACE ($config->debug == true): #0 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\PageFinder->___getQuery(Object(ProcessWire\Selectors), Array) #1 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___getQuery', Array) #2 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageFinder), 'getQuery', Array) #3 /var/www/uhtinew.dev/wire/core/PageFinder.php(391): ProcessWire\Wire->__call('getQuery', Array) #4 /var/www/uhtinew.dev/wire/core/PageFinder.php(391): ProcessWire\PageFinder->getQuery(Object(ProcessWire\Selectors), Array) #5 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\PageFinder->___find(Object(ProcessWire\Selectors), Array) #6 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___find', Array) #7 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageFinder), 'find', Array) #8 /var/www/uhtinew.dev/wire/core/PagesLoader.php(248): ProcessWire\Wire->__call('find', Array) #9 /var/www/uhtinew.dev/wire/core/PagesLoader.php(248): ProcessWire\PageFinder->find(Object(ProcessWire\Selectors), Array) #10 /var/www/uhtinew.dev/wire/core/Pages.php(232): ProcessWire\PagesLoader->find('template=member...', Array) #11 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\Pages->___find('template=member...', Array) #12 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___find', Array) #13 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\Pages), 'find', Array) #14 /var/www/uhtinew.dev/wire/core/PagesType.php(260): ProcessWire\Wire->__call('find', Array) #15 /var/www/uhtinew.dev/wire/core/PagesType.php(260): ProcessWire\Pages->find('template=member...', Array) #16 /var/www/uhtinew.dev/site/modules/ProcessUhtiPageAddMember/ProcessUhtiPageAddMember.module(231): ProcessWire\PagesType->find('template=member...') #17 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\ProcessUhtiPageAddMember->membersExist(Object(ProcessWire\HookEvent)) #18 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessPageAdd), 'executeMemberse...', Array) #19 /var/www/uhtinew.dev/wire/core/ProcessController.php(244): ProcessWire\Wire->__call('executeMemberse...', Array) #20 /var/www/uhtinew.dev/wire/core/ProcessController.php(244): ProcessWire\ProcessPageAdd->executeMembersexist() #21 /var/www/uhtinew.dev/wire/core/Wire.php(380): ProcessWire\ProcessController->___execute() #22 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___execute', Array) #23 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessController), 'execute', Array) #24 /var/www/uhtinew.dev/wire/core/admin.php(113): ProcessWire\Wire->__call('execute', Array) #25 /var/www/uhtinew.dev/wire/core/admin.php(113): ProcessWire\ProcessController->execute() #26 /var/www/uhtinew.dev/site/modules/AdminThemeCustom/controller.php(13): require('/var/www/uhtine...') #27 /var/www/uhtinew.dev/site/templates/admin.php(520): require('/var/www/uhtine...') #28 /var/www/uhtinew.dev/wire/core/TemplateFile.php(268): require('/var/www/uhtine...') #29 /var/www/uhtinew.dev/wire/core/Wire.php(380): ProcessWire\TemplateFile->___render() #30 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___render', Array) #31 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\TemplateFile), 'render', Array) #32 /var/www/uhtinew.dev/wire/modules/PageRender.module(514): ProcessWire\Wire->__call('render', Array) #33 /var/www/uhtinew.dev/wire/modules/PageRender.module(514): ProcessWire\TemplateFile->render() #34 /var/www/uhtinew.dev/wire/core/Wire.php(383): ProcessWire\PageRender->___renderPage(Object(ProcessWire\HookEvent)) #35 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___renderPage', Array) #36 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageRender), 'renderPage', Array) #37 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\Wire->__call('renderPage', Array) #38 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\PageRender->renderPage(Object(ProcessWire\HookEvent)) #39 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\Page), 'render', Array) #40 /var/www/uhtinew.dev/wire/modules/Process/ProcessPageView.module(205): ProcessWire\Wire->__call('render', Array) #41 /var/www/uhtinew.dev/wire/modules/Process/ProcessPageView.module(205): ProcessWire\Page->render() #42 /var/www/uhtinew.dev/wire/core/Wire.php(383): ProcessWire\ProcessPageView->___execute(true) #43 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___execute', Array) #44 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessPageView), 'execute', Array) #45 /var/www/uhtinew.dev/index.php(56): ProcessWire\Wire->__call('execute', Array) #46 /var/www/uhtinew.dev/index.php(56): ProcessWire\ProcessPageView->execute(true) #47 {main} It seems like the hook never gets called, although I can see it in Tracy Hooks Triggered I tried to add a low hook priority 10 but it doesn't make a difference. Also tried to add the hook in admin.php, since I only need it there. Same results. Do you have this working in one of your projects and what could be wrong here? Link to comment Share on other sites More sharing options...
BitPoet Posted April 6, 2018 Share Posted April 6, 2018 It's a quick&dirty modification of a selector extension I whipped up out of curiosity a while ago. I'll take a look at it over the weekend. 3 Link to comment Share on other sites More sharing options...
BitPoet Posted April 6, 2018 Share Posted April 6, 2018 Found the reason. getQueryUnknownField is far younger than I expected. It was added in preparation for the field.owner.subfield syntax in 3.0.87 which was officially introduced in 3.0.95. 3 Link to comment Share on other sites More sharing options...
Robin S Posted April 7, 2018 Share Posted April 7, 2018 A different approach which occurred to me is saving sounds-like data to hidden fields on a page and then searching those fields. This allows for other sounds-like algorithms such as metaphone, and allows for "contains" searches rather than only "equals" searches. Notes on the code that follows: You would create hidden fields "metaphones" and "soundex" and then add those to templates as needed. In the example I just save sounds-like data for the title field, but you could include other fields also. Using Double Metaphone would give more accurate results, but I just used metaphone in the example for simplicity. In the search code I am only searching the sounds-like data, but in the real world you would include other fields in the selector also as the "normal" part of the search. In /site/ready.php: $pages->addHookAfter('saveReady', function(HookEvent $event) { $page = $event->arguments(0); if(!$page->id || !($page->template->hasField('metaphones') && $page->template->hasField('soundex'))) return; $metaphones = ''; $soundex = ''; $words = explode(' ', $page->title); // Get the individual words of field(s) foreach($words as $word) { if(strlen($word) < 3) continue; // Ignore short words $metaphones .= metaphone($word) . ' '; $soundex .= soundex($word) . ' '; } $page->metaphones = $metaphones; $page->soundex = $soundex; }); In search template file: // $q is the sanitized search string $words = explode(' ', $q); $metaphones = ''; $soundex = ''; foreach($words as $word) { if(strlen($word) < 3) continue; // Ignore short words $metaphones .= metaphone($word) . ' '; $soundex .= soundex($word) . ' '; } $selector = "(metaphones~=$metaphones), "; $selector .= "(soundex~=$soundex)"; $results = $pages->find($selector); This allows matching a page title of "The quick brown fox jumps over the lazy dog" by search strings "offer took" and "quiz fogs" thanks to the differences between metaphone and soundex. Don't expect too much from it though - I found plenty of soundalike words that didn't match. The general principle could be expanded with other algorithms, and it would be cool to enhance this by allowing for mixed matches - for example, where in a two word search one word matches metaphone and the second word matches soundex. Edit: on that last point, a simple way would be to use just a single hidden field for both the metaphone and soundex data. The data from those two algorithms is sufficiently different that unwanted matches wouldn't happen. But if other algorithms were added you'd have to check to make sure the data from one algorithm wouldn't be confused with that of another. 2 Link to comment Share on other sites More sharing options...
Robin S Posted April 7, 2018 Share Posted April 7, 2018 (edited) Here's an example with a single "sounds_like" field containing both metaphone and soundex data. This is better I think. In /site/ready.php: $pages->addHookAfter('saveReady', function(HookEvent $event) { $page = $event->arguments(0); if(!$page->id || !$page->template->hasField('sounds_like')) return; $sounds_like = ''; $words = explode(' ', $page->title); // Get the individual words of field(s) foreach($words as $word) { if(strlen($word) < 3) continue; // Ignore short words $sounds_like .= metaphone($word) . ' ' . soundex($word) . ' '; } $page->sounds_like = $sounds_like; } In search template file: // $q is the sanitized search string $words = explode(' ', $q); $selector = ''; foreach($words as $word) { if(strlen($word) < 3) continue; // Ignore short words $selector .= 'sounds_like~=' . metaphone($word) . '|' . soundex($word) . ', '; } $results = $pages->find($selector); Edited April 7, 2018 by Robin S Avoid second loop in search template. 3 Link to comment Share on other sites More sharing options...
gebeer Posted April 7, 2018 Share Posted April 7, 2018 10 hours ago, BitPoet said: Found the reason. getQueryUnknownField is far younger than I expected. It was added in preparation for the field.owner.subfield syntax in 3.0.87 which was officially introduced in 3.0.95. Thanks for that hint. My site is running on 3.0.62. That explains why it is not working for me. Will look at upgrading. @Robin S Interesting approach, thank you! Would this add significant performance overhead with using soundex() and metaphone() on the PHP side rather than in MySQL? Link to comment Share on other sites More sharing options...
Robin S Posted April 7, 2018 Share Posted April 7, 2018 1 hour ago, gebeer said: Would this add significant performance overhead with using soundex() and metaphone() on the PHP side rather than in MySQL? I shouldn't think so. The overhead in the search template would be negligible because searches are usually only for a few words at most. In the saveReady hook it would depend on how much content in the page you are saving sounds-like data for. I can't see it being an issue in most circumstances. 2 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now