Module: Site indexer

Alessio Dal Bianco · July 22, 2013

On 7/22/2013 at 2:40 PM, SteveB said:
This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You are right. But there is this option:

It is not exactly what you explained, but usually a word with more of 3-4 chars can be considerated a real keyword.

teppo · July 22, 2013

On 7/22/2013 at 2:40 PM, SteveB said:
This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

You probably mean stop words? Seems like a reasonable feature to me, especially if made configurable (as English stop words make very little sense / are sometimes even harmful for a site written in Finnish etc.)

Just saying.

ryan · July 23, 2013

_('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others.

Alessio Dal Bianco · July 23, 2013

On 7/23/2013 at 10:50 AM, ryan said:
_('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others.

Ok, i've switch from _() to __() !

ryan · July 25, 2013

Quote

Ok, i've switch from _() to __() !

If the call is within a class that extends one of ProcessWire's (like Wire or WireData), it's actually better to use $this->_('your text'); as there is a little bit less overhead with that call than with a __('your text'); call.

marco · September 20, 2013

I had to commend out a line in the Indexer.module file (v0.5.1, line 246, getKeywords()) that strips numbers from indexed text. I had to do this, because my site contains product names using numbers (e.g. "serviceFLAT360") that weren't be found.

Perhaps you could introduce a config option for the module to enable/disable number stripping.

Alessio Dal Bianco · September 20, 2013

Hi marco, i'm working on a new version because i'm facing your same problem. It will be released soon!

marco · September 20, 2013

Great! I'm looking forawrd to it.

Alessio Dal Bianco · September 26, 2013

Hi all,

New changes / features under the hood, check it out!

http://modules.processwire.com/modules/indexer/

Jeroen Diderik · November 24, 2013

Hey Alessio,

Love the module, works great.

I missed one thing though... Page fields.

I use Page fields regularly to make for example references to Genres, Categories, Countries etc.

So I added some code to also add the pagenames of the pages in Page fields.

I created a pull reguest on Github to add this change to your code.

For those wanting to try this out, replace the extractTextFromField function in Indexer.module with this one or just add the elseif() part (start line 372) to it:

     public function extractTextFromField($f, $p){
        if( preg_match('/text|title|url/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = strip_tags($p->get($f->name));
            return ' '.$stripped;
        elseif( preg_match('/page/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ):
            $stripped = "";
            $f_ref = $p->get($f->name);
            if($f_ref instanceOf PageArray){
                foreach($f_ref as $fp){
                    $stripped .= ' '.strip_tags($fp->name);
                }
            }else{
                $stripped .= ' '.strip_tags($f_ref->name);
            }
            return $stripped;
        endif;

     }

Alessio Dal Bianco · November 26, 2013

Hi jdiderick,

Thank you for the addition! I will release a new version soon with your code and some other improvements.

ADB

Alessio Dal Bianco · December 17, 2013

I all,

i have update the module.

Some fixes for repeater fields plus the addition of Diderik (Thank you!)

ADB

ceberlin · March 30, 2014

HI Allesio,

I am right now evaluating the module for a project. This project has languages.

My questions:

RIght now the stopwords list is hardcoded into the modules folder. In case I add or change anything there, it would be lost after an update of the module. Correct?

Isn't assets a better place for the stopwords? - Or the database?

What do I do if I am in a multilingual environment. How can I set stopwords per language? Can I at all?

thetuningspoon · November 7, 2014

Hi Alessio, thank you for your work on this module. It looks like it's just what I need on an upcoming project, and accomplishes it in the same way I was contemplating doing it.

One question: Will this index Excel files as well as PDFs and Word files?

HerTha · January 17

The Site Indexer module has been in use on one of our websites for quite a while. It worked very well for us for extracting text from PDFs.

I just started to review the site in order to make it work with PHP 8.1. It turned out that some adjustments would be necessary at a few places, including the Indexer module.

My PHP skills are quite limited, but the module author @Alessio Dal Bianco was so kind to give me a helping hand (thanks Alessio!).

The additions of @DaveP's fork

have also been incorporated.

The result can be accessed on Github (v0.8.3 branch):

https://github.com/USSliberty/Processwire-site-indexer/tree/v0.8.3

Happy Wireing!

Alessio Dal Bianco · January 18

Hello to everyone!

As @HerTha wrote, I resumed my very old module in order to make it compatible with PHP 8.1 and I've done the bare minimum (for now) for make it work. I did not create yet an Official release because:

In order to not break things for the current installations, I still need to adapt and test the new PHP class that @DavePintroduced in its own fork:

LukeMadhanga\Pdf2Text::pdf2txt($filefullpath);

I suspect that was already included in his composer or something.

The module as you know is quite old, I tested it with the latest version of Processwire some days ago and it seems to work fine overall, if you have any suggestion on the module structure let me know here https://github.com/USSliberty/Processwire-site-indexer/pull/4!

I am curious if there is some kind of metrics on how many installations/downloads have my module, I saw that during these years many (better) alternatives have popped up, in the end I would like to know if any of you still need support for it in terms of features and/or bugs or I can just release this minor version and mark the module as deprecated or something else.

Alessio

Module: Site indexer

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Recently Browsing 0 members

Similar Content