Jump to content

SOUNDS LIKE / SOUNDEX() operator


teppo
 Share

Recommended Posts

This is just an idea I've been playing with a while: adding support for SOUNDS LIKE operator. I'm not sure how much sense it makes in the larger context, but I'm posting it here nevertheless in case it results in something useful :)
 
As a bit of a background, in MySQL SOUNDS LIKE is just a shortcut for comparing two strings passed to SOUNDEX() and as Wikipedia kindly explains, "soundex is a phonetic algorithm  for indexing names by sound, as pronounced in English". In other words, SOUNDEX() converts a string to an "index" of sorts, and comparing index values of two separate strings answers the question of "do these two sound alike (in English)".
 
SOUNDS LIKE works best with English words, and literally words, since comparing entire sentences is often much less sensible (there's no native "wildcard soundex operator" in MySQL) -- names of people, products, countries, cities, etc. are a good example where one might benefit from SOUNDS LIKE. Cases I've really wished this would've been possible have included especially name and city searches, where people tend to mistype things a lot.
 
Yesterday I decided that this would be a cool experiment, so I applied following hack to the DatabaseQuerySelectFulltext.php and Selector.php, making "€=" the SOUNDS LIKE operator. This is wrong at many levels, but works as a simple proof of concept:

diff --git a/wire/core/DatabaseQuerySelectFulltext.php b/wire/core/DatabaseQuerySelectFulltext.php
index 9c6064f..421f38a 100644
--- a/wire/core/DatabaseQuerySelectFulltext.php
+++ b/wire/core/DatabaseQuerySelectFulltext.php
@@ -83,6 +83,12 @@ class DatabaseQuerySelectFulltext extends Wire {
                                $query->where("$tableField LIKE '%$v%'"); // SLOW, but assumed
                                break;
 
+                       case '€=':
+                               $v = $database->escapeStr($value); 
+                               $v = $this->escapeLIKE($v); 
+                               $query->where("$tableField SOUNDS LIKE '$v'"); // SLOW, but assumed
+                               break;
+
                        case '^=':
                        case '%^=': // match at start using only LIKE (no index)
                                $v = $database->escapeStr($value);
diff --git a/wire/core/Selector.php b/wire/core/Selector.php
index 31748f9..fccfe8a 100644
--- a/wire/core/Selector.php
+++ b/wire/core/Selector.php
@@ -204,6 +204,7 @@ abstract class Selector extends WireData {
                Selectors::addType(SelectorLessThanEqual::getOperator(), 'SelectorLessThanEqual'); 
                Selectors::addType(SelectorContains::getOperator(), 'SelectorContains'); 
                Selectors::addType(SelectorContainsLike::getOperator(), 'SelectorContainsLike'); 
+               Selectors::addType(SelectorSoundsLike::getOperator(), 'SelectorSoundex'); 
                Selectors::addType(SelectorContainsWords::getOperator(), 'SelectorContainsWords'); 
                Selectors::addType(SelectorStarts::getOperator(), 'SelectorStarts'); 
                Selectors::addType(SelectorStartsLike::getOperator(), 'SelectorStartsLike'); 
@@ -284,6 +285,10 @@ class SelectorContainsLike extends SelectorContains {
        public static function getOperator() { return '%='; }
 }
 
+class SelectorSoundsLike extends SelectorContains { 
+       public static function getOperator() { return '€='; }
+}
+
 /**
  * Selector that matches one string value that happens to have all of it's words present in another string value (regardless of individual word location)
  *

Just for fun I've been testing this new operator with a script like this (and using one of the most mistyped names I've seen on this forum as a test subject), to see if it works at all:

<?php
require 'index.php';
foreach (wire('pages')->find('title€=antti') as $k => $p) {
    echo $k . ". " . $p->url . "\n";
}

// results look something like this:
// 0. /anttti/
// 1. /antti/
// 2. /anti/
// 3. /antii/

So, what do you folks think -- is there anything in this that might be worth taking further? Can you think of a better approach to this, such as a custom fieldtype with SOUNDS LIKE support, or something? Anything else? :)

  • Like 10
  • Thanks 2
Link to comment
Share on other sites

I think it would be great to have pluggable selector operators and/or matchers supported directly in the core. If it did, such behavior - and others - could then be easily extended with community modules. For an example, if I preferred an UDF Levenshtein distance calculator instead of SOUNDEX to implement fuzzy searching, I could then do that.

  • Like 8
Link to comment
Share on other sites

(@sforsman I didn't think mySQL had native support for Levenshtein distances without adding modules/triggers/routines etc? PHP does, so post-processing search results that way works.)

Hey DaveP! Correct, it doesn't, but if you take a closer look, I was referring to a MySQL UDF. Stored procedures or post-processing with PHP would be awfully slow in comparison, which is the exact reason that forced me to write an UDF implementation back in the days. I can dig it up if you are interested in the subject. (Compiling and linking UDFs with MySQL isn't really as complex as they make it sound.)

On a sidenote, Elasticsearch for an example - which is backed by Lucene - would obviously provide even greater speeds for fuzzy searching. However we were talking about something that can be combined with the native PW-selectors (i.e. something that can be sticked inside the resulting SQL query). The purpose of my example was to promote a generic way for hooking custom selector operators to existing matchers.

  • Like 5
Link to comment
Share on other sites

  • 4 weeks later...

@Mats, that might be an interesting project, but seems to have one major shortcoming. Since it's something you'd use in your templates (or in a module) instead of at the database, it's going to cause various performance issues; you'll have to fetch entire dataset before performing the fuzzy search, or iterate smaller result sets one by one.

They even mention performance on the TXP plugin page as a "known issue" (though there might code-related causes for that too, not just the point at which the fuzzy search is triggered). In either case, I guess this is fine as long as you know you won't have to consider scalability, but for me that's a pretty big no-no  :)

ProcessWire does perform in-memory searches too, in which case this would make more sense, but at that point you'll be filtering a pre-fetched list of pages, which kind of misses the point of fuzzy search IMHO.

  • Like 2
Link to comment
Share on other sites

One approach that can work (and can scale) is using native mySQL - like soundex - to find a subset of results and then applying some php processing to them.

For example, @teppo's original results in his opening post, sorted by Levenshtein distance ascending in php, would give very good results very quickly. And with a 'limit' in the selector, you can be sure of scalability.

  • Like 1
Link to comment
Share on other sites

  • 3 years later...

Hello all,

I'd like to revive this thread as I could use such functionality for a project with more than 2000 users round the world. When new users are added I'd like to perform a fuzzy search on text fields firstname and lastname to avoid accidental duplicate entries.


@teppo thanks a lot for the code examples. Everything is basically there to perform soundex search.

But how would one implement this today without hacking the core? Would that be possible in a module that extends classes Selector and DatabaseQuerySelectFulltext? Or an additional option to $pages->find(), e.g. ['fuzzy' =>true]?

I'm not that experienced at OOP yet, so any pointers on how to approach this would be very much appreciated.

Link to comment
Share on other sites

2 hours ago, gebeer said:

When new users are added I'd like to perform a fuzzy search on text fields firstname and lastname to avoid accidental duplicate entries.

Adds a soundex.field=something search option (so no custom operator) for database queries (so in your case, you'd use a selector like "soundex.firstname=$firstname, soundex.lastname=$lastname):

<?php

// Add to site/ready.php

wire()->addHookAfter("PageFinder::getQueryUnknownField", function(HookEvent $event) {
	static $n = 0;
	
	$fieldName = $event->arguments(0);
	$data = $event->arguments(1);

	// We only handle our soundex queries
	if($fieldName != "soundex") return;
		
	if(empty($data["subfield"])) {
		throw new PageFinderSyntaxException("soundex selector expects the name of the field to search as subfield, e.g. 'soundex.title=something'");
	}
	
	$selectors = $data["selectors"];
	$selector = $data["selector"];
	$query = $data["query"];
	
	if(! $selector instanceof SelectorEqual) {
		throw new PageFinderSyntaxException("Invalid operator. Only '=' is allowed in soundex selectors");
	}
	
	$subfield = "data";
	$fieldName = $data["subfields"];
	if(strpos($fieldName, '.') > 0) list($fieldName, $subfield) = explode('.', $fieldName);
	else $fieldName = $data["subfield"];

	$field = wire('fields')->get($fieldName);
	if(!$field) {
		return; // does not resolve to a known field
	}

	// Initialize some needed variables
	$database = wire('database');
	
	$table = $database->escapeTable($field->getTable());
	$subfield = $database->escapeCol(wire('sanitizer')->fieldName($subfield));
	$n++;
	$tableAlias = "_subquery_{$n}_$table";

	$join = "$table AS $tableAlias ON $tableAlias.pages_id=pages.id";
	$where = "$tableAlias.$subfield SOUNDS LIKE " . $database->quote($selector->value);

	$query = $data["query"];
	$query->join($join);
	$query->where($where);
	
	$selectors->remove($selector);

	$event->return = true;
});

 

  • Like 8
Link to comment
Share on other sites

@BitPoet Thank you very much for the code!

I put it in ready.php and my find logic inside a custom ProcessPageAdd::executeMembersExist hook looks like this

		$selector = 'template=member';
		$firstname = $this->wire('sanitizer')->text($this->wire('input')->get('firstname')); 
		$lastname = $this->wire('sanitizer')->text($this->wire('input')->get('lastname')); 
		if($firstname) $selector .= ", soundex.firstname={$firstname}";
		if($lastname) $selector .= ", soundex.lastname={$lastname}";
		$selector .= ', include=all';
		$users = $this->wire('users')->find($selector);

I always get an error: Field does not exist: soundex

Full backtrace:

ProcessPageAdd: Field does not exist: soundex

DEBUG MODE BACKTRACE ($config->debug == true):
#0 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\PageFinder->___getQuery(Object(ProcessWire\Selectors), Array)
#1 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___getQuery', Array)
#2 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageFinder), 'getQuery', Array)
#3 /var/www/uhtinew.dev/wire/core/PageFinder.php(391): ProcessWire\Wire->__call('getQuery', Array)
#4 /var/www/uhtinew.dev/wire/core/PageFinder.php(391): ProcessWire\PageFinder->getQuery(Object(ProcessWire\Selectors), Array)
#5 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\PageFinder->___find(Object(ProcessWire\Selectors), Array)
#6 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___find', Array)
#7 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageFinder), 'find', Array)
#8 /var/www/uhtinew.dev/wire/core/PagesLoader.php(248): ProcessWire\Wire->__call('find', Array)
#9 /var/www/uhtinew.dev/wire/core/PagesLoader.php(248): ProcessWire\PageFinder->find(Object(ProcessWire\Selectors), Array)
#10 /var/www/uhtinew.dev/wire/core/Pages.php(232): ProcessWire\PagesLoader->find('template=member...', Array)
#11 /var/www/uhtinew.dev/wire/core/Wire.php(386): ProcessWire\Pages->___find('template=member...', Array)
#12 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___find', Array)
#13 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\Pages), 'find', Array)
#14 /var/www/uhtinew.dev/wire/core/PagesType.php(260): ProcessWire\Wire->__call('find', Array)
#15 /var/www/uhtinew.dev/wire/core/PagesType.php(260): ProcessWire\Pages->find('template=member...', Array)
#16 /var/www/uhtinew.dev/site/modules/ProcessUhtiPageAddMember/ProcessUhtiPageAddMember.module(231): ProcessWire\PagesType->find('template=member...')
#17 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\ProcessUhtiPageAddMember->membersExist(Object(ProcessWire\HookEvent))
#18 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessPageAdd), 'executeMemberse...', Array)
#19 /var/www/uhtinew.dev/wire/core/ProcessController.php(244): ProcessWire\Wire->__call('executeMemberse...', Array)
#20 /var/www/uhtinew.dev/wire/core/ProcessController.php(244): ProcessWire\ProcessPageAdd->executeMembersexist()
#21 /var/www/uhtinew.dev/wire/core/Wire.php(380): ProcessWire\ProcessController->___execute()
#22 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___execute', Array)
#23 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessController), 'execute', Array)
#24 /var/www/uhtinew.dev/wire/core/admin.php(113): ProcessWire\Wire->__call('execute', Array)
#25 /var/www/uhtinew.dev/wire/core/admin.php(113): ProcessWire\ProcessController->execute()
#26 /var/www/uhtinew.dev/site/modules/AdminThemeCustom/controller.php(13): require('/var/www/uhtine...')
#27 /var/www/uhtinew.dev/site/templates/admin.php(520): require('/var/www/uhtine...')
#28 /var/www/uhtinew.dev/wire/core/TemplateFile.php(268): require('/var/www/uhtine...')
#29 /var/www/uhtinew.dev/wire/core/Wire.php(380): ProcessWire\TemplateFile->___render()
#30 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___render', Array)
#31 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\TemplateFile), 'render', Array)
#32 /var/www/uhtinew.dev/wire/modules/PageRender.module(514): ProcessWire\Wire->__call('render', Array)
#33 /var/www/uhtinew.dev/wire/modules/PageRender.module(514): ProcessWire\TemplateFile->render()
#34 /var/www/uhtinew.dev/wire/core/Wire.php(383): ProcessWire\PageRender->___renderPage(Object(ProcessWire\HookEvent))
#35 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___renderPage', Array)
#36 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\PageRender), 'renderPage', Array)
#37 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\Wire->__call('renderPage', Array)
#38 /var/www/uhtinew.dev/wire/core/WireHooks.php(782): ProcessWire\PageRender->renderPage(Object(ProcessWire\HookEvent))
#39 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\Page), 'render', Array)
#40 /var/www/uhtinew.dev/wire/modules/Process/ProcessPageView.module(205): ProcessWire\Wire->__call('render', Array)
#41 /var/www/uhtinew.dev/wire/modules/Process/ProcessPageView.module(205): ProcessWire\Page->render()
#42 /var/www/uhtinew.dev/wire/core/Wire.php(383): ProcessWire\ProcessPageView->___execute(true)
#43 /var/www/uhtinew.dev/wire/core/WireHooks.php(698): ProcessWire\Wire->_callMethod('___execute', Array)
#44 /var/www/uhtinew.dev/wire/core/Wire.php(442): ProcessWire\WireHooks->runHooks(Object(ProcessWire\ProcessPageView), 'execute', Array)
#45 /var/www/uhtinew.dev/index.php(56): ProcessWire\Wire->__call('execute', Array)
#46 /var/www/uhtinew.dev/index.php(56): ProcessWire\ProcessPageView->execute(true)
#47 {main}

It seems like the hook never gets called, although I can see it in Tracy Hooks Triggered

5ac7184de2c1c_2018-04-06-134821.png.ae60bd6a80a13241dba2d723ef9d54dd.png

I tried to add a low hook priority 10 but it doesn't make a difference.

Also tried to add the hook in admin.php, since I only need it there. Same results.

Do you have this working in one of your projects and what could be wrong here?

Link to comment
Share on other sites

A different approach which occurred to me is saving sounds-like data to hidden fields on a page and then searching those fields. This allows for other sounds-like algorithms such as metaphone, and allows for "contains" searches rather than only "equals" searches.

Notes on the code that follows:

  • You would create hidden fields "metaphones" and "soundex" and then add those to templates as needed.
  • In the example I just save sounds-like data for the title field, but you could include other fields also.
  • Using Double Metaphone would give more accurate results, but I just used metaphone in the example for simplicity.
  • In the search code I am only searching the sounds-like data, but in the real world you would include other fields in the selector also as the "normal" part of the search.

 

In /site/ready.php:

$pages->addHookAfter('saveReady', function(HookEvent $event) {
    $page = $event->arguments(0);
    if(!$page->id || !($page->template->hasField('metaphones') && $page->template->hasField('soundex'))) return;
    $metaphones = '';
    $soundex = '';
    $words = explode(' ', $page->title); // Get the individual words of field(s)
    foreach($words as $word) {
        if(strlen($word) < 3) continue; // Ignore short words
        $metaphones .= metaphone($word) . ' ';
        $soundex .= soundex($word) . ' ';
    }
    $page->metaphones = $metaphones;
    $page->soundex = $soundex;
});

In search template file:

// $q is the sanitized search string
$words = explode(' ', $q);
$metaphones = '';
$soundex = '';
foreach($words as $word) {
    if(strlen($word) < 3) continue; // Ignore short words
    $metaphones .= metaphone($word) . ' ';
    $soundex .= soundex($word) . ' ';
}
$selector = "(metaphones~=$metaphones), ";
$selector .= "(soundex~=$soundex)";
$results = $pages->find($selector);

 

This allows matching a page title of "The quick brown fox jumps over the lazy dog" by search strings "offer took" and "quiz fogs" thanks to the differences between metaphone and soundex. Don't expect too much from it though - I found plenty of soundalike words that didn't match.

The general principle could be expanded with other algorithms, and it would be cool to enhance this by allowing for mixed matches - for example, where in a two word search one word matches metaphone and the second word matches soundex. Edit: on that last point, a simple way would be to use just a single hidden field for both the metaphone and soundex data. The data from those two algorithms is sufficiently different that unwanted matches wouldn't happen. But if other algorithms were added you'd have to check to make sure the data from one algorithm wouldn't be confused with that of another.

  • Like 2
Link to comment
Share on other sites

Here's an example with a single "sounds_like" field containing both metaphone and soundex data. This is better I think.

In /site/ready.php:

$pages->addHookAfter('saveReady', function(HookEvent $event) {
    $page = $event->arguments(0);
    if(!$page->id || !$page->template->hasField('sounds_like')) return;
    $sounds_like = '';
    $words = explode(' ', $page->title); // Get the individual words of field(s)
    foreach($words as $word) {
        if(strlen($word) < 3) continue; // Ignore short words
        $sounds_like .= metaphone($word) . ' ' . soundex($word) . ' ';
    }
    $page->sounds_like = $sounds_like;
}

In search template file:

// $q is the sanitized search string
$words = explode(' ', $q);
$selector = '';
foreach($words as $word) {
    if(strlen($word) < 3) continue; // Ignore short words
    $selector .= 'sounds_like~=' .  metaphone($word) . '|' . soundex($word) . ', ';
}
$results = $pages->find($selector);

 

Edited by Robin S
Avoid second loop in search template.
  • Like 3
Link to comment
Share on other sites

10 hours ago, BitPoet said:

Found the reason. getQueryUnknownField is far younger than I expected. It was added in preparation for the field.owner.subfield syntax in 3.0.87 which was officially introduced in 3.0.95.

Thanks for that hint. My site is running on 3.0.62. That explains why it is not working for me. Will look at upgrading.

@Robin S Interesting approach, thank you! Would this add significant performance overhead with using  soundex() and metaphone() on the PHP side rather than in MySQL?

Link to comment
Share on other sites

1 hour ago, gebeer said:

Would this add significant performance overhead with using  soundex() and metaphone() on the PHP side rather than in MySQL?

I shouldn't think so. The overhead in the search template would be negligible because searches are usually only for a few words at most. In the saveReady hook it would depend on how much content in the page you are saving sounds-like data for. I can't see it being an issue in most circumstances.

  • Like 2
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...