Alessio Dal Bianco Posted July 22, 2013 Author Share Posted July 22, 2013 This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords. You are right. But there is this option: It is not exactly what you explained, but usually a word with more of 3-4 chars can be considerated a real keyword. Link to comment Share on other sites More sharing options...
teppo Posted July 22, 2013 Share Posted July 22, 2013 This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords. You probably mean stop words? Seems like a reasonable feature to me, especially if made configurable (as English stop words make very little sense / are sometimes even harmful for a site written in Finnish etc.) Just saying. 2 Link to comment Share on other sites More sharing options...
ryan Posted July 23, 2013 Share Posted July 23, 2013 _('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others. Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted July 23, 2013 Author Share Posted July 23, 2013 _('text') is for gettext, but ProcessWire doesn't use gettext. I think that _() was actually meant to be a __('text') or a $this->_('text') ? Those are the ProcessWire translation functions, among others. Ok, i've switch from _() to __() ! 1 Link to comment Share on other sites More sharing options...
ryan Posted July 25, 2013 Share Posted July 25, 2013 Ok, i've switch from _() to __() ! If the call is within a class that extends one of ProcessWire's (like Wire or WireData), it's actually better to use $this->_('your text'); as there is a little bit less overhead with that call than with a __('your text'); call. Link to comment Share on other sites More sharing options...
marco Posted September 20, 2013 Share Posted September 20, 2013 I had to commend out a line in the Indexer.module file (v0.5.1, line 246, getKeywords()) that strips numbers from indexed text. I had to do this, because my site contains product names using numbers (e.g. "serviceFLAT360") that weren't be found. Perhaps you could introduce a config option for the module to enable/disable number stripping. Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted September 20, 2013 Author Share Posted September 20, 2013 Hi marco, i'm working on a new version because i'm facing your same problem. It will be released soon! Link to comment Share on other sites More sharing options...
marco Posted September 20, 2013 Share Posted September 20, 2013 Great! I'm looking forawrd to it. Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted September 26, 2013 Author Share Posted September 26, 2013 Hi all, New changes / features under the hood, check it out! http://modules.processwire.com/modules/indexer/ 2 Link to comment Share on other sites More sharing options...
Jeroen Diderik Posted November 24, 2013 Share Posted November 24, 2013 Hey Alessio, Love the module, works great. I missed one thing though... Page fields. I use Page fields regularly to make for example references to Genres, Categories, Countries etc. So I added some code to also add the pagenames of the pages in Page fields. I created a pull reguest on Github to add this change to your code. For those wanting to try this out, replace the extractTextFromField function in Indexer.module with this one or just add the elseif() part (start line 372) to it: public function extractTextFromField($f, $p){ if( preg_match('/text|title|url/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ): $stripped = strip_tags($p->get($f->name)); return ' '.$stripped; elseif( preg_match('/page/i', $f->type) && $p->editable($f->name) && $f->name != self::fieldName ): $stripped = ""; $f_ref = $p->get($f->name); if($f_ref instanceOf PageArray){ foreach($f_ref as $fp){ $stripped .= ' '.strip_tags($fp->name); } }else{ $stripped .= ' '.strip_tags($f_ref->name); } return $stripped; endif; } 1 Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted November 26, 2013 Author Share Posted November 26, 2013 Hi jdiderick, Thank you for the addition! I will release a new version soon with your code and some other improvements. ADB 2 Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted December 17, 2013 Author Share Posted December 17, 2013 I all, i have update the module. Some fixes for repeater fields plus the addition of Diderik (Thank you!) ADB 3 Link to comment Share on other sites More sharing options...
ceberlin Posted March 30, 2014 Share Posted March 30, 2014 HI Allesio, I am right now evaluating the module for a project. This project has languages. My questions: RIght now the stopwords list is hardcoded into the modules folder. In case I add or change anything there, it would be lost after an update of the module. Correct? Isn't assets a better place for the stopwords? - Or the database? What do I do if I am in a multilingual environment. How can I set stopwords per language? Can I at all? Link to comment Share on other sites More sharing options...
thetuningspoon Posted November 7, 2014 Share Posted November 7, 2014 Hi Alessio, thank you for your work on this module. It looks like it's just what I need on an upcoming project, and accomplishes it in the same way I was contemplating doing it. One question: Will this index Excel files as well as PDFs and Word files? Link to comment Share on other sites More sharing options...
HerTha Posted January 17 Share Posted January 17 The Site Indexer module has been in use on one of our websites for quite a while. It worked very well for us for extracting text from PDFs. I just started to review the site in order to make it work with PHP 8.1. It turned out that some adjustments would be necessary at a few places, including the Indexer module. My PHP skills are quite limited, but the module author @Alessio Dal Bianco was so kind to give me a helping hand (thanks Alessio!). The additions of @DaveP's fork have also been incorporated. The result can be accessed on Github (v0.8.3 branch): https://github.com/USSliberty/Processwire-site-indexer/tree/v0.8.3 Happy Wireing! 1 Link to comment Share on other sites More sharing options...
Alessio Dal Bianco Posted January 18 Author Share Posted January 18 Hello to everyone! As @HerTha wrote, I resumed my very old module in order to make it compatible with PHP 8.1 and I've done the bare minimum (for now) for make it work. I did not create yet an Official release because: In order to not break things for the current installations, I still need to adapt and test the new PHP class that @DavePintroduced in its own fork: LukeMadhanga\Pdf2Text::pdf2txt($filefullpath); I suspect that was already included in his composer or something. The module as you know is quite old, I tested it with the latest version of Processwire some days ago and it seems to work fine overall, if you have any suggestion on the module structure let me know here https://github.com/USSliberty/Processwire-site-indexer/pull/4! I am curious if there is some kind of metrics on how many installations/downloads have my module, I saw that during these years many (better) alternatives have popped up, in the end I would like to know if any of you still need support for it in terms of features and/or bugs or I can just release this minor version and mark the module as deprecated or something else. Alessio 1 Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now