Jump to content

Module: Site indexer


Recommended Posts

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You are right. But there is this option:

post-834-0-73521500-1374504339_thumb.png

It is not exactly what you explained, but usually a word with more of 3-4 chars can be considerated a real keyword.

Link to comment
Share on other sites

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You probably mean stop words? Seems like a reasonable feature to me, especially if made configurable (as English stop words make very little sense / are sometimes even harmful for a site written in Finnish etc.)

Just saying.

  • Like 2
Link to comment
Share on other sites

_('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others

Link to comment
Share on other sites

Ok, i've switch from _() to __() ! 

 
 
If the call is within a class that extends one of ProcessWire's (like Wire or WireData), it's actually better to use $this->_('your text'); as there is a little bit less overhead with that call than with a __('your text'); call. 
Link to comment
Share on other sites

  • 1 month later...

I had to commend out a line in the Indexer.module file (v0.5.1, line 246, getKeywords()) that strips numbers from indexed text. I had to do this, because my site contains product names using numbers (e.g. "serviceFLAT360") that weren't be found.

Perhaps you could introduce a config option for the module to enable/disable number stripping.

Link to comment
Share on other sites

  • 1 month later...

Hey Alessio,

Love the module, works great.

I missed one thing though... Page fields.

I use Page fields regularly to make for example references to Genres, Categories, Countries etc.

So I added some code to also add the pagenames of the pages in Page fields.

I created a pull reguest on Github to add this change to your code.

For those wanting to try this out, replace the extractTextFromField function in Indexer.module with this one or just add the elseif() part (start line 372) to it:

     public function extractTextFromField($f, $p){
        if( preg_match('/text|title|url/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = strip_tags($p->get($f->name));
            return ' '.$stripped;
        elseif( preg_match('/page/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = "";
            $f_ref = $p->get($f->name);
            if($f_ref instanceOf PageArray){
                foreach($f_ref as $fp){
                    $stripped .= ' '.strip_tags($fp->name);
                }
            }else{
                $stripped .= ' '.strip_tags($f_ref->name);
            }
            return $stripped;
        endif;

     }
  • Like 1
Link to comment
Share on other sites

  • 3 weeks later...
  • 3 months later...

HI Allesio,

I am right now evaluating the module for a project. This project has languages.

My questions:

RIght now the stopwords list is hardcoded into the modules folder. In case I add or change anything there, it would be lost after an update of the module. Correct?

Isn't assets a better place for the stopwords? - Or the database?

What do I do if I am in a multilingual environment. How can I set stopwords per language? Can I at all?

Link to comment
Share on other sites

  • 7 months later...

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...