Jump to content


Photo

Strange searching engine behaviours

search

  • Please log in to reply
6 replies to this topic

#1 Remi

Remi

    Distinguished Member

  • Members
  • PipPip
  • 24 posts
  • 6

  • LocationNorway / Poland

Posted 18 June 2012 - 04:59 PM

Hello again!

I'm still working with my PW powered site. I really love PW, but I've got problem with searching engine, while this is the most important part of my site (norwegian to polish dictionary)

My searching rules:
if (strlen($q)<=4) {
// My sollution for short words:
$matches = $pages->find("title|translation=$q, sort=title, limit=100");
// $matches = $pages->find("title|translation|definition%=$q, sort=title, sort=translation, limit=100");
// $matches = $pages->find("title|translation|definition^=$q, sort=title, sort=translation, limit=100");
}
else
{
  $matches = $pages->find("title|translation|definition~=$q, sort=title, sort=translation, limit=50");
}

Some examples:

Search word: sąd
Results:
domstol
sąd

frukthage en
sad

Search word: bąk
Results:
bak
za, z tyłu, w tyle

bak en
pośladek, tył

rørdrum en
bąk

So it looks that ą = a :/ Can someone tell me why?


Another problem:
When I'll change search querry to:
$matches = $pages->find("title|translation|definition%=$q, sort=title, sort=translation, limit=100");
and I'm searching for: dom ("house" in english and "hus" in norwegian)
I've got these results:
advisere
ogłosić, powiadom

aktivitetshus et
dom kultury

alderdom en
starość

aldershjem et
dom starców

ane
przeczuwać, domyślać się, podejrzewać
ane - aner - ante - ant

etc... while I don't get hus in the first 100 matches :/

When I'll change search querry to:
$matches = $pages->find("title|translation|definition%=$q, limit=100");

It's a little bit better, but still hus is very far from beginning of the search results.
How can I improve that? Is there any rule to sort results by match?

#2 Remi

Remi

    Distinguished Member

  • Members
  • PipPip
  • 24 posts
  • 6

  • LocationNorway / Poland

Posted 19 June 2012 - 03:59 AM

About first issue:
It looks that search engine use character replacement from Page Name module, so I know how to fix that.

But what about second issue?

#3 ryan

ryan

    Hero Member

  • Administrators
  • 5,780 posts
  • 3125

  • LocationAtlanta, GA

Posted 19 June 2012 - 11:21 AM

Actually I think that the first issue has to do with your MySQL collation. Right now I think it would be utf8_general_ci which sorts (and I think text indexes) as if it were all ASCII, sans accents. This is often desirable, but not in your case. You'd want to change the collation to something like utf8_unicode_ci or a more language-specific version if available (like utf8_swedish_ci). Btw, the "ci" part means "case insensitive". If you want case sensitivity, then you'd want the "cs" version, like utf8_unicode_cs.

To change the collation you'd want to edit your database in a tool like PhpMyAdmin. Find the tables for the fields in question: field_title, field_translation, field_definition. Edit the 'data' column of those tables. Change the collation from utf8_general_ci to utf8_unicode_ci. Rebuild the index for each by running a table repair. You can do this in PhpMyAdmin by editing the table, clicking the "operations" tab and clicking the "Repair" link at the bottom. Once you complete this, I think that you should get the behavior you wanted with regard to the matches and sorting.


Regarding your second question, I'm a little confused on this one because you are searching for "dom" but wanting it to match "hus"? Obviously that's not possible. But what I think you are wanting is for it to match "dom" by itself rather than in the middle of other words and such... ensuring that your "dom" definition (that contains "hus") comes up a lot sooner than it currently is? If so, you need to use the ~= operator in your search rather than %=. Because %= is a non-indexed (LIKE) search, it'll match terms in the middle of words. Whereas ~= will only match full words. However, ~= is an indexed search and MySQL fulltext indexes only work with words of 4 characters or more by default. But if you have control over your server environment, this is an easy fix to make it support fewer characters.

Btw, when you don't specify a "sort=" in your selector, and your selector searches text using either *= or ~=, the sort will be by relevance of the terms searched.

#4 Remi

Remi

    Distinguished Member

  • Members
  • PipPip
  • 24 posts
  • 6

  • LocationNorway / Poland

Posted 19 June 2012 - 01:57 PM

Regarding first issue - I will check that later.

About second issue
I've got page:
title: hus
translation (my field): dom

So, I want to find hus (page) by searching for: dom.

After all, I'll check everything later tonight.

#5 Remi

Remi

    Distinguished Member

  • Members
  • PipPip
  • 24 posts
  • 6

  • LocationNorway / Poland

Posted 22 June 2012 - 03:12 AM

I've done some research and I've got what I wanted.

First issue
It's not so important for me right now, but I will check that later.

Second issue
I've got results similar to my old CMS :)

I'm using shared hosting, so I can't modify MySQL configuration files. I have to use LIKE method, but with standard selector %= I had too many unwanted results.

I've decided to add extra selector %^= which use SQL LIKE method but gives me what I wanted :)

To do that I had to modify 2 files in ./wire/core folder:

DatabaseQuerySelectFulltext.php:
// At line 76 I've added:

case '%^=':
  $v = $this->db->escape_string($value);
  $v = preg_replace('/([%^_])/', '\\\$1', $v); // prep value for use in LIKE
  $query->where("$tableField LIKE '$v%'"); // SLOW, but assumed
  break;

Selector.php:
// At line 147 I've added:

  Selectors::addType(SelectorContainsLikeStarts::getOperator(), 'SelectorContainsLikeStarts');

// At line 226 I've added:

class SelectorContainsLikeStarts extends SelectorContains {
  public static function getOperator() { return '%^='; }
}

After that, I can search for short words using LIKE% method intead of %LIKE% which gives me less but highly wanted results :) It's similar to ^= selector.

I've proposed changes at GitHub.

Attached Files



#6 ryan

ryan

    Hero Member

  • Administrators
  • 5,780 posts
  • 3125

  • LocationAtlanta, GA

Posted 23 June 2012 - 01:22 PM

Thanks for the updates RTurala, this seems like a nice addition. I will plan to bring it into the the core soon.

#7 Remi

Remi

    Distinguished Member

  • Members
  • PipPip
  • 24 posts
  • 6

  • LocationNorway / Poland

Posted 23 June 2012 - 01:30 PM

You will find these changes pulled by me on github.

Another idea:
In some cases using %LIKE method can be useful for someone. So, there is room for the next selector (last possible to use with MySQL?).
I don't need that, but if You don't have enough time, I can make some changes in core files.





0 user(s) are reading this topic

0 members, 0 guests, 0 anonymous users