Jump to content

Site search finds html tags


Soma
 Share

Recommended Posts

Just wanted to share that, when using pages find to search title|headline|body, since the body usually is text which contains html the search also finds html elements within text.

http://processwire.com/search/?q=strong

The pages found do not contain the word strong.

I know searching pages using a pages->find(selector) does not account for this, but maybe one should clean all those tags out, or add it to stop words? Stripping tags before a search on texarea?

Searching site with PW in genreal

How do you guys also feel about using pages->find() to use as a site search? In cases where data pages are is pulled into other pages, it's getting hard to work around those cases and keep track of it. Also sometimes it seems the search doesn't return pages in correct relevance, depending on a lot of factors what fields you search and if they're full text etc. What is you're experience with PW search? If you use multilanguage, the stopwords are still only the english once used that come with core. So not ideal still.

What you guys think about having a search tool to index pages using a parser and write index tables? Or would you use another tool or google custom search?

Link to comment
Share on other sites

Having written site search implementations in the past, it's easy to see the limitations in using the current selector-based search.

Just daydreaming, but one can envisage a site search module with configurable stopwords and such, extensible to include soundex and mySQL sounds_like, as well as fulltext search with query expansion, and maybe a bit of levenshtein distance thrown in, all of which can help create a powerful site search.

...just daydreaming, though...  :rolleyes:

Link to comment
Share on other sites

What you guys think about having a search tool to index pages using a parser and write index tables? Or would you use another tool or google custom search?

Lucene/Solr integration or specifc module would be ideal. If you are planning to make a module (are you? :)), then definitely go with custom module, which would be much easier for 95% of people. Some kind of document/file search is must for certain sites.

Link to comment
Share on other sites

Lucene/Solr integration or specifc module would be ideal. If you are planning to make a module (are you? :)), then definitely go with custom module, which would be much easier for 95% of people. Some kind of document/file search is must for certain sites.

I know virtually nothing about the inner workings of search engines but wouldn't it be hard to make make a module that comes close to the power and features of some proven technologies out there? Lucene based solutions like Solr or ElasticSearch, or http://sphinxsearch.com/ seem hard to beat.

Of course, you won't be able to run those on relatively cheap hosting but when a project requires this kind of search power you are looking at some dedicated stuff anyways?

Link to comment
Share on other sites

No I'm not building anything, just thought you could (:P).

Yeah I agree it would be nice to have some custom module, but building a search index myself in the past I know how hard it is also regarding multilang. So using some third party tools does make sense as for not reinventing the wheel.

Link to comment
Share on other sites

Related to what I told before, a very simple module: Creates a new field body_notags (I'm taking body for example but it could have settings to choose the fields) and on save passes the content of body through strip_tags() and stores it there. Then you just have to search on body_notags instead of body. Easy peasy.

Link to comment
Share on other sites

Thanks diogo, I know something like this is possible, just not a solution I like and think it the way to go especially on big sites. Also just stripping tags would leave you with connected words where usually a white space should be etc.

After testing a little I found %= operator has very strange results, it returns pages that don't contain the word at all. While when doing a search in mysql admin with %word% on body it returns only 2 pages, while using pages find it returns 4. No idea what is going on but definately something wrong there.

Now looking a PW db query the ~= operator would be the one to choose anyway, as it searches multiple words and actually uses the stopwords compared to all other operators.

So adding stopwords is also easy and I have now added some additional words for certain html tags.

$stopwords = array("table","tbody","thead","tfoot","height","strong","align","href","style","left");
foreach($stopwords as $w) DatabaseStopwords::add($w);
Link to comment
Share on other sites

Soma, I don't know if < and > characters can be included in the stopwords, but if so, the words could be "<table", "/table>" , "<strong", "strong/>"...

You still have a problem with attributes, but those are not as easy to solve...

Link to comment
Share on other sites

Soma, I don't know if < and > characters can be included in the stopwords, but if so, the words could be "<table", "/table>" , "<strong", "strong/>"...

You still have a problem with attributes, but those are not as easy to solve...

Why? You could add those, but they're not gonna do anything as you can't and don't search with <> anyway.  Adding the words works just fine.

BTW when I search with "<keyword" you land in nowhere land, throwing a selector error.

Ryan bug? Even tho using $sanitizer->selectorValue("<keyword") the < get's through and throws a fatal error.

Edit: Hm happend in 2.2.9 installs, it seems to work correct in later versions.

Link to comment
Share on other sites

oh, true... delete. I was thinking in reverse and trying to avoid that you would exclude a real "strong" in the text. Crossed my mind that it would work like: exclude strong if it has a < to the left... but that was dumb...

Ryan, would it make sense to implement a selector that uses REGEXP?

Link to comment
Share on other sites

I don't really see HTML tags being indexed as a problem. I've taken advantage of that on a few occasions and was glad it was there. Honestly, I've never had the need for more than the built-in text searching capabilities. Though my needs aren't everyone's either, so not saying there wouldn't be a need. There have been times when one operator or another better suited a particular client, whether %= or ~= or *=, but I've never had the need to use an external solution. There have been one or two instances where I had a large quantity of fields that needed to be included in the search, and it became a potential bottleneck. This is part of the reason why the FieldtypeCache exists. You can bundle all of your text fields into one, and then search the cache rather than the individual fields. So you would search for "field%=some text" rather than "field1|field2|field3|field4%=some text". It works quite well for this. It's been awhile since I've had the need to use it, but it's one of the original core Fieldtypes and worth looking at if you run into a similar issue of too many fields to search, or needing those fields to be in the same block for ranking purposes. (Search ranking works a little bit differently when the combination of fields is ranked vs. individually ranked as separate fields). 

As for external search engines, I think it would be hard to beat something like Google CSE (if that's what they still call it). I've also used Sphider as a self hosted PHP-based solution, and was quite happy with it at the time… though this was before ProcessWire existed. But it still seems to be an active, and highest rated (to Google) PHP search engine. It does include the ability to index PDF, DOC and other files, though requires external converters. If I recall, it works quite well for that though.

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...