Site search finds html tags

Soma · March 26, 2013

Just wanted to share that, when using pages find to search title|headline|body, since the body usually is text which contains html the search also finds html elements within text.

http://processwire.com/search/?q=strong

The pages found do not contain the word strong.

I know searching pages using a pages->find(selector) does not account for this, but maybe one should clean all those tags out, or add it to stop words? Stripping tags before a search on texarea?

Searching site with PW in genreal

How do you guys also feel about using pages->find() to use as a site search? In cases where data pages are is pulled into other pages, it's getting hard to work around those cases and keep track of it. Also sometimes it seems the search doesn't return pages in correct relevance, depending on a lot of factors what fields you search and if they're full text etc. What is you're experience with PW search? If you use multilanguage, the stopwords are still only the english once used that come with core. So not ideal still.

What you guys think about having a search tool to index pages using a parser and write index tables? Or would you use another tool or google custom search?

MarcC · March 26, 2013

I like the convenience of using the find method for simple search purposes, but I can see cases where a more formal search tool would actually save time. Good points.

DaveP · March 26, 2013

Having written site search implementations in the past, it's easy to see the limitations in using the current selector-based search.

Just daydreaming, but one can envisage a site search module with configurable stopwords and such, extensible to include soundex and mySQL sounds_like, as well as fulltext search with query expansion, and maybe a bit of levenshtein distance thrown in, all of which can help create a powerful site search.

...just daydreaming, though...

diogo · March 26, 2013

One way would be to store a stripped (strip_tags()) version of the content in another column on the field table, and search on that one instead.

apeisa · March 26, 2013

Current way is pretty good in most scenarios. Biggest problem for us is missing file search (pdf, word etc).

apeisa · March 26, 2013

What you guys think about having a search tool to index pages using a parser and write index tables? Or would you use another tool or google custom search?

Lucene/Solr integration or specifc module would be ideal. If you are planning to make a module (are you? ), then definitely go with custom module, which would be much easier for 95% of people. Some kind of document/file search is must for certain sites.

SiNNuT · March 26, 2013

Lucene/Solr integration or specifc module would be ideal. If you are planning to make a module (are you? ), then definitely go with custom module, which would be much easier for 95% of people. Some kind of document/file search is must for certain sites.

I know virtually nothing about the inner workings of search engines but wouldn't it be hard to make make a module that comes close to the power and features of some proven technologies out there? Lucene based solutions like Solr or ElasticSearch, or http://sphinxsearch.com/ seem hard to beat.

Of course, you won't be able to run those on relatively cheap hosting but when a project requires this kind of search power you are looking at some dedicated stuff anyways?

apeisa · March 26, 2013

I think there is lots of good ground between solr etc. and what pw currently has.

Anyway, search can be really hard so if building search heavy services you better know what you are doing.

Soma · March 27, 2013

No I'm not building anything, just thought you could ().

Yeah I agree it would be nice to have some custom module, but building a search index myself in the past I know how hard it is also regarding multilang. So using some third party tools does make sense as for not reinventing the wheel.

diogo · March 27, 2013

Related to what I told before, a very simple module: Creates a new field body_notags (I'm taking body for example but it could have settings to choose the fields) and on save passes the content of body through strip_tags() and stores it there. Then you just have to search on body_notags instead of body. Easy peasy.

Soma · March 27, 2013

Thanks diogo, I know something like this is possible, just not a solution I like and think it the way to go especially on big sites. Also just stripping tags would leave you with connected words where usually a white space should be etc.

After testing a little I found %= operator has very strange results, it returns pages that don't contain the word at all. While when doing a search in mysql admin with %word% on body it returns only 2 pages, while using pages find it returns 4. No idea what is going on but definately something wrong there.

Now looking a PW db query the ~= operator would be the one to choose anyway, as it searches multiple words and actually uses the stopwords compared to all other operators.

So adding stopwords is also easy and I have now added some additional words for certain html tags.

$stopwords = array("table","tbody","thead","tfoot","height","strong","align","href","style","left");
foreach($stopwords as $w) DatabaseStopwords::add($w);

diogo · March 27, 2013

Soma, I don't know if < and > characters can be included in the stopwords, but if so, the words could be "<table", "/table>" , "<strong", "strong/>"...

You still have a problem with attributes, but those are not as easy to solve...

Soma · March 27, 2013

Soma, I don't know if < and > characters can be included in the stopwords, but if so, the words could be "<table", "/table>" , "<strong", "strong/>"...

You still have a problem with attributes, but those are not as easy to solve...

Why? You could add those, but they're not gonna do anything as you can't and don't search with <> anyway. Adding the words works just fine.

BTW when I search with "<keyword" you land in nowhere land, throwing a selector error.

Ryan bug? Even tho using $sanitizer->selectorValue("<keyword") the < get's through and throws a fatal error.

Edit: Hm happend in 2.2.9 installs, it seems to work correct in later versions.

diogo · March 27, 2013

oh, true... delete. I was thinking in reverse and trying to avoid that you would exclude a real "strong" in the text. Crossed my mind that it would work like: exclude strong if it has a < to the left... but that was dumb...

Ryan, would it make sense to implement a selector that uses REGEXP?

ryan · March 28, 2013

I don't really see HTML tags being indexed as a problem. I've taken advantage of that on a few occasions and was glad it was there. Honestly, I've never had the need for more than the built-in text searching capabilities. Though my needs aren't everyone's either, so not saying there wouldn't be a need. There have been times when one operator or another better suited a particular client, whether %= or ~= or *=, but I've never had the need to use an external solution. There have been one or two instances where I had a large quantity of fields that needed to be included in the search, and it became a potential bottleneck. This is part of the reason why the FieldtypeCache exists. You can bundle all of your text fields into one, and then search the cache rather than the individual fields. So you would search for "field%=some text" rather than "field1|field2|field3|field4%=some text". It works quite well for this. It's been awhile since I've had the need to use it, but it's one of the original core Fieldtypes and worth looking at if you run into a similar issue of too many fields to search, or needing those fields to be in the same block for ranking purposes. (Search ranking works a little bit differently when the combination of fields is ranked vs. individually ranked as separate fields).

As for external search engines, I think it would be hard to beat something like Google CSE (if that's what they still call it). I've also used Sphider as a self hosted PHP-based solution, and was quite happy with it at the time… though this was before ProcessWire existed. But it still seems to be an active, and highest rated (to Google) PHP search engine. It does include the ability to index PDF, DOC and other files, though requires external converters. If I recall, it works quite well for that though.

Sign In

Site search finds html tags

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content