Alessio Dal Bianco Posted May 8, 2013 Posted May 8, 2013 Hi all! I have created this new module that improve the current search engine on PW: https://github.com/USSliberty/Processwire-site-indexer (Beta version) The main idea is to create an hidden field that store keywords (separated by space). The keywords are generated automatically from all text fields in the page, plus PDFs and DOCs files. So if you create new text fields you can forget to add they on Search Page Module. The only thing to do after install, is to change the list of fields in Search Page (see attachment). In fact you need to search only in "indexer" field. NOTE 1: At this time the module index only when you save the page. In the next week maybe i will add the complete-site re-index. NOTE 2: The files are indexed with 2 Unix packages (poppler-utils & wv). I have tried without success with pure PHP classes, but if know a class that works fine i can add it into module. ADB 15
adrian Posted May 8, 2013 Posted May 8, 2013 Looking forward to checking this out - thanks! I have been using the attached set of functions for a long time to extract text from PDFs. Probably not as powerful as poppler, but might do what you need. I made some poorly documented changes to the original. Anyway, maybe you'll find something useful in there. pdf2txt.php 1
Alessio Dal Bianco Posted May 9, 2013 Author Posted May 9, 2013 Thank you! I will try it. I only need to convert files in plain text, if i can do this without any other package it will be great!
ryan Posted May 10, 2013 Posted May 10, 2013 Looks very cool Alessio! I look forward to seeing this in the modules directory.
Alessio Dal Bianco Posted May 12, 2013 Author Posted May 12, 2013 Added! @adrian i've tried your class but doesn't return me any text...... . Have you got some pdfs that you know works ?
adrian Posted May 12, 2013 Posted May 12, 2013 Sorry you had no luck with that class. It has been working well for me - several hundred PDFs with no failures so far. I have attached one that definitely works so maybe you can figure out what the issue might be. EDIT: Notice that in the main function I changed it so it always uses the handleV2 function. The V3 one wasn't working for me, but you might want to look into that some more. ian_newsletter_405 (2).pdf
Alessio Dal Bianco Posted May 13, 2013 Author Posted May 13, 2013 Thank you adrian, i'm looking in now
adrian Posted May 13, 2013 Posted May 13, 2013 I haven't used this, but I went looking for other options and found this: http://pastebin.com/hRviHKp1 Might be worth trying.
Alessio Dal Bianco Posted May 13, 2013 Author Posted May 13, 2013 I have tried both, also this one picked from PHPClasses: http://www.phpclasses.org/browse/file/31030.html This one return me only "Local strategies to reduce climate risk" that is the caption of the first image. At this time i suspect that version of my PHP have some trouble (PHP/5.3.3-7+squeeze9)....
adrian Posted May 13, 2013 Posted May 13, 2013 That is weird. I looked back at my php source downloads and I was running 5.3.3 at some point back in 2010 and that script was still working then. It is strange that none of those options are working. Let me know if there is any testing I can do at my end for you that might help. When I get a minute or two I might try implementing that class into your module and see how it works for me.
Alessio Dal Bianco Posted May 13, 2013 Author Posted May 13, 2013 If you can print $chunk["data"] and $data after this line maybe it will be useful. $data = gzuncompress($chunk["data"]); Thank you!
adrian Posted May 13, 2013 Posted May 13, 2013 Just installed the module for the first time and received this error: Notice: Undefined index: maxlength in /xxx/site/modules/Indexer/Indexer.module on line 185
Alessio Dal Bianco Posted May 13, 2013 Author Posted May 13, 2013 That's true, i haven't notice that since i haven't enabled the debug on PW. Now is fixed on GIT!
adrian Posted May 14, 2013 Posted May 14, 2013 Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point). Can you try this and see if it works for you at least? pdf2txt.php
adrian Posted May 14, 2013 Posted May 14, 2013 On another note - just noticed that the uninstaller routine does not remove the indexer field.
Alessio Dal Bianco Posted May 14, 2013 Author Posted May 14, 2013 (edited) Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point). Can you try this and see if it works for you at least? It's true, it works alone but if i integrate into my module doesn't return nothing. Very weird... Edited May 14, 2013 by Alessio Dal Bianco
Alessio Dal Bianco Posted May 14, 2013 Author Posted May 14, 2013 Yuhuu! Now It works! I think some variables where overwritten somewhere. Now i will prepare a new version of the module with this Class. Thank you adrian for help! PS: I have updated the credits Pdf2txt.php 1
Alessio Dal Bianco Posted May 14, 2013 Author Posted May 14, 2013 Ok, updated GIT & Modules directory here (0.3.0 now) • Now you can optionally select to use the PHP class or poppler. • On the uninstall now i remove the indexer field. • More normalization of the text stored in the indexer field. USSliberty
adrian Posted May 15, 2013 Posted May 15, 2013 Awesome, I'll be using this with a couple of projects that are just winding up. I'll let you know if I come across any issues. If I find time I might try to go through and clean up that class too - there is a fair bit of unneeded code in there and lots of undefined variables. Thanks for your hard work on this. 1
Alessio Dal Bianco Posted May 16, 2013 Author Posted May 16, 2013 I have noticed that is better to use "%=" than "~=" with my module. Additionaly i was wondering if i can change programmaticaly the Default Search operator....
Alessio Dal Bianco Posted July 8, 2013 Author Posted July 8, 2013 Hi all, i have updated the module. Now you can reindex all pages at once! 4
Timo Posted July 22, 2013 Posted July 22, 2013 Hi Alessio, when I try to access the modules configuration page I get the following error: Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) Am I missing some dependencies here?
Alessio Dal Bianco Posted July 22, 2013 Author Posted July 22, 2013 Hi Alessio, when I try to access the modules configuration page I get the following error: Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) Am I missing some dependencies here? Hi Timo, the _() function stands for gettext() function (see here: http://php.net/manual/en/function.gettext.php). Maybe you have not enable the gettext module ? 1
Timo Posted July 22, 2013 Posted July 22, 2013 Oh, I didn't know it's a default php function. I enabled the module in my php.ini, now everything works fine. Thanks for the help! 1
SteveB Posted July 22, 2013 Posted July 22, 2013 This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords. 1
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now