Jump to content

Module: Site indexer


Recommended Posts

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You are right. But there is this option:

post-834-0-73521500-1374504339_thumb.png

It is not exactly what you explained, but usually a word with more of 3-4 chars can be considerated a real keyword.

Link to comment
Share on other sites

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You probably mean stop words? Seems like a reasonable feature to me, especially if made configurable (as English stop words make very little sense / are sometimes even harmful for a site written in Finnish etc.)

Just saying.

  • Like 2
Link to comment
Share on other sites

_('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others

Link to comment
Share on other sites

Ok, i've switch from _() to __() ! 

 
 
If the call is within a class that extends one of ProcessWire's (like Wire or WireData), it's actually better to use $this->_('your text'); as there is a little bit less overhead with that call than with a __('your text'); call. 
Link to comment
Share on other sites

  • 1 month later...

I had to commend out a line in the Indexer.module file (v0.5.1, line 246, getKeywords()) that strips numbers from indexed text. I had to do this, because my site contains product names using numbers (e.g. "serviceFLAT360") that weren't be found.

Perhaps you could introduce a config option for the module to enable/disable number stripping.

Link to comment
Share on other sites

  • 1 month later...

Hey Alessio,

Love the module, works great.

I missed one thing though... Page fields.

I use Page fields regularly to make for example references to Genres, Categories, Countries etc.

So I added some code to also add the pagenames of the pages in Page fields.

I created a pull reguest on Github to add this change to your code.

For those wanting to try this out, replace the extractTextFromField function in Indexer.module with this one or just add the elseif() part (start line 372) to it:

     public function extractTextFromField($f, $p){
        if( preg_match('/text|title|url/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = strip_tags($p->get($f->name));
            return ' '.$stripped;
        elseif( preg_match('/page/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = "";
            $f_ref = $p->get($f->name);
            if($f_ref instanceOf PageArray){
                foreach($f_ref as $fp){
                    $stripped .= ' '.strip_tags($fp->name);
                }
            }else{
                $stripped .= ' '.strip_tags($f_ref->name);
            }
            return $stripped;
        endif;

     }
  • Like 1
Link to comment
Share on other sites

  • 3 weeks later...
  • 3 months later...

HI Allesio,

I am right now evaluating the module for a project. This project has languages.

My questions:

RIght now the stopwords list is hardcoded into the modules folder. In case I add or change anything there, it would be lost after an update of the module. Correct?

Isn't assets a better place for the stopwords? - Or the database?

What do I do if I am in a multilingual environment. How can I set stopwords per language? Can I at all?

Link to comment
Share on other sites

  • 7 months later...
  • 10 years later...

The Site Indexer module has been in use on one of our websites for quite a while. It worked very well for us for extracting text from PDFs.

I just started to review the site in order to make it work with PHP 8.1. It turned out that some adjustments would be necessary at a few places, including the Indexer module.

My PHP skills are quite limited, but the module author @Alessio Dal Bianco was so kind to give me a helping hand (thanks Alessio!).

The additions of @DaveP's fork

have also been incorporated.

The result can be accessed on Github (v0.8.3 branch):

https://github.com/USSliberty/Processwire-site-indexer/tree/v0.8.3

Happy Wireing!

  • Like 1
Link to comment
Share on other sites

Hello to everyone!

As @HerTha wrote, I resumed my very old module in order to make it compatible with PHP 8.1 and I've done the bare minimum (for now) for make it work. I did not create yet an Official release because: 

In order to not break things for the current installations, I still need to adapt and test the new PHP class that @DavePintroduced in its own fork:

LukeMadhanga\Pdf2Text::pdf2txt($filefullpath);

I suspect that was already included in his composer or something.

The module as you know is quite old, I tested it with the latest version of Processwire some days ago and it seems to work fine overall, if you have any suggestion on the module structure let me know here https://github.com/USSliberty/Processwire-site-indexer/pull/4!

I am curious if there is some kind of metrics on how many installations/downloads have my module, I saw that during these years many (better) alternatives have popped up, in the end I would like to know if any of you still need support for it in terms of features and/or bugs or I can just release this minor version and mark the module as deprecated or something else.

 

Alessio

 

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...