Jump to content

Weekly update – 3 July 2020


ryan
 Share

Recommended Posts

This week I'm not bumping the version number just yet because I've got lots of work in progress. The biggest thing so far is something I hinted at last week. Basically, I like what the addition of the MySQL query expansion operators have brought (per posts last week and week before), but they also reveal what's lacking: something as simple as a search for "books" still can't directly match the word "book". But that's the most basic example. It's not a limitation of ProcessWire, but just the type of database indexes in general. 

I think it'd be amazing if ProcessWire had the ability of being really smart about this stuff, able to interpolate not just plurals vs. singulars but related words. In a perfect world, this is what query expansion would do (in addition to what it already does). But the reality is that it involves all kinds of complicated logic, rules and dictionaries; well beyond the scope of even a database. And it can be vastly different depending on the language. So this isn't something we can just add to the core and have it work. 

On the other hand, I figured maybe we should just put in a hookable method that just pretends the ability was there. Then people could hook it and make it respond with variations of words, according to their needs. The searches that use query expansion could then call this method and use whatever it returns... for when someday the ability is there. 

So I went ahead and added that hook —  WireTextTools::wordAlternates(). And our database-searching class (DatabaseQuerySelectFulltext) now calls upon it, just in case an implementation is available.

Well, after getting that hook added and having our class call it, naturally I wanted to test it out. So I got to work on it and came up with this module: WireWordTools

The WireWordTools module provides an API for English word inflection and lemmatisation. And it hooks that new method mentioned above, so that you can install it and immediately have it bring your searches to the next level. While it only helps for English-language searches, maybe we'll be able to add more languages to it, or maybe it'll lead to other modules that do the same thing for other languages. 

The expanded/alternate words are only used for searches that use the new query expansion operators, which are the ones that have a "+" in them: ~+=, ~|+=, *+=, **+=. They all can return similar results, but are weighted differently. Unlike most operators, where the logic is direct and you can expect them to always behave the same way, these query expansion operators are more subjective, and ones I think we should intend to keep tweaking and improving over time to continually improve the quality of the results they return. Basically, they are geared towards building site search engines, so I think it makes sense for us to pursue anything that makes them better at that, rather than aiming to always have them return the same thing. I am currently testing out the ~|+= operator ("contains any words expand") on our main site search engine here, along with the WireWordTools module. Finally, searching for "books" does match "book" too, and a lot more. More to be done here, but it's a good start hopefully.

  • Like 20
Link to comment
Share on other sites

@bernhard I didn't come up with the dictionary words in the JSON files, they are converted from an existing one (here) and apparently the original source is wordnet.princeton.edu. So I'm not sure if those particular words are intended or mistakes. New to me, but "wa" and "wo" are actual English words. Though as far as I can tell they aren't related to "was" or "will". I can't imagine those two instances will ever be helpful for our intended use case so maybe it makes sense to remove them. My plan was to keep looking for more existing dictionaries and continue to merge them into the one in WireWordTools so that it becomes more comprehensive over time. 

  • Like 4
Link to comment
Share on other sites

On 7/3/2020 at 11:43 PM, ryan said:

Basically, I like what the addition of the MySQL query expansion operators have brought (per posts last week and week before), but they also reveal what's lacking: something as simple as a search for "books" still can't directly match the word "book". But that's the most basic example. It's not a limitation of ProcessWire, but just the type of database indexes in general.

This is where postgresql outshines mysql by far. It can do stemming and accent support for multiple languages out of the box, ranking, fuzzy search, … which are the things you found missing. Many people/resources seem to suggest not bothering with mysql for advanced full text search needs, but directly going to purpose built external technologies for that, while postgresql provides a stepping stone, before needing to go that route.

  • Like 3
Link to comment
Share on other sites

Quote

It can do stemming and accent support for multiple languages out of the box, ranking, fuzzy search, … which are the things you found missing. 

Good article and postgresql looks interesting with its search capabilities, thanks. Though none of these really solve what I was after here. I experimented quite as bit with stemming and different stemming libraries. Though they all did roughly the same thing. When it came to searching, stemming just wasn’t that useful. WireWordTools originally had a stemming library and methods, and the appropriate fulltext queries included the word stems with wildcards. In the end, it just wasn’t helpful most of the time. And in the few cases where it was worthwhile, it was redundant, though far less thorough, than what we already had with inflection and lemmatisation. So while stemming can have its uses, it’s not even half way there, if trying to build a smart search. Cool nevertheless that they have it built-in apparently.

As far as accent support, ranking and fuzzy search, these are all things that MySQL does as well, though maybe there are differences in how they do them. For instance, MySQL supports “sounds like” and also supports pluggable parsers for fulltext searches. Fuzzy search also isn't what I'm after here, but certainly interested in exploring in the future. 

For me the most useful thing by far is boolean mode searches, particularly in InnoDB, which has a full-text engine modeled on Sphinx. Boolean mode searches are really very powerful, enabling you to specify what’s required, what’s excluded, matching of words or phrases, partial matching of words with wildcards, specifying noise words, isolating distance between words, adjusting ranking up or down on a per-word basis, grouped expressions and nested subexpressions. All while being incredibly fast. I’m pretty thrilled with what MySQL supports here and what it brings to ProcessWire. Postgresql looks very nice too, but for our needs, I don’t feel we are lacking anything relative to it. 

Quote

Many people/resources seem to suggest not bothering with mysql for advanced full text search needs…

I think anyone that would say that as a general thing is not very familiar with what MySQL fulltext supports, or maybe is thinking of fulltext support where it was back a long time ago. For ProcessWire and the scale that most use it at, MySQL fulltext is really a sweet spot, enabling PW to deliver enormous power and capability when it comes to search features. 

  • Like 6
Link to comment
Share on other sites

You have just said a whole load of things I don't understand since my 20 years that I started developing. This is why I choose ProcessWire, I dont understand it now, I know its more than likely covered in the future.  

  • Like 1
Link to comment
Share on other sites

  • 1 month later...
On 7/3/2020 at 2:43 PM, ryan said:

Finally, searching for "books" does match "book" too, and a lot more.

I just started playing around with this and I thought I was doing something wrong, but based on the lemmas, "books" will match "source".  I thought I was going crazy because I was getting a ton of extra, unexpected results, but I suppose this definition makes that association:

// WireWordTools/lemmas/s.json
"sourcebooks": "sourcebook",

A heads up for anyone diving into Ryan's WireWordTools module.

Link to comment
Share on other sites

  • 4 months later...

I had a request from a client to be able to handle spelling errors from users using their site search, so I've added various missplet words (mostly from https://en.wikipedia.org/wiki/Commonly_misspelled_English_words ) together with some America English spellings to the lemmas for the WireWordTools module.

Not 100% sure if this was the best way to do things or if there should be a different mechanism for handling alternative spellings but hey, it was easy to do and works well.

I've forked the module on GitHub if anyone wants the updated word lists

https://github.com/millipedia/WireWordTools

and there's a list of the words I added there as well if that's useful to @ryan or anyone else.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...