Jump to content

Module: Site indexer


Recommended Posts

Hi all!

I have created this new module that improve the current search engine on PW:

https://github.com/USSliberty/Processwire-site-indexer (Beta version)

The main idea is to create an hidden field that store keywords (separated by space). The keywords are generated automatically from all text fields in the page, plus PDFs and DOCs files.

So if you create new text fields you can forget to add they on Search Page Module.

The only thing to do after install, is to change the list of fields in Search Page (see attachment). In fact you need to search only in "indexer" field.

post-834-0-54191500-1368028283_thumb.png

NOTE 1: At this time the module index only when you save the page. In the next week maybe i will add the complete-site re-index.

NOTE 2: The files are indexed with 2 Unix packages (poppler-utils & wv). I have tried without success with pure PHP classes, but if know a class that works fine i can add it into module.

ADB

  • Like 15
Link to comment
Share on other sites

Looking forward to checking this out - thanks!

I have been using the attached set of functions for a long time to extract text from PDFs. Probably not as powerful as poppler, but might do what you need. I made some poorly documented changes to the original. Anyway, maybe you'll find something useful in there.

pdf2txt.php

  • Like 1
Link to comment
Share on other sites

Sorry you had no luck with that class. It has been working well for me - several hundred PDFs with no failures so far.

I have attached one that definitely works so maybe you can figure out what the issue might be.

EDIT: Notice that in the main function I changed it so it always uses the handleV2 function. The V3 one wasn't working for me, but you might want to look into that some more.

ian_newsletter_405 (2).pdf

Link to comment
Share on other sites

That is weird. I looked back at my php source downloads and I was running 5.3.3 at some point back in 2010 and that script was still working then. It is strange that none of those options are working. Let me know if there is any testing I can do at my end for you that might help.

When I get a minute or two I might try implementing that class into your module and see how it works for me.

Link to comment
Share on other sites

Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point).

Can you try this and see if it works for you at least?

pdf2txt.php

Link to comment
Share on other sites

 Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point).

 

Can you try this and see if it works for you at least?

It's true, it works alone but if i integrate into my module doesn't return nothing. Very weird...

Edited by Alessio Dal Bianco
Link to comment
Share on other sites

Awesome, I'll be using this with a couple of projects that are just winding up. I'll let you know if I come across any issues. If I find time I might try to go through and clean up that class too - there is a fair bit of unneeded code in there and lots of undefined variables.

Thanks for your hard work on this.

  • Like 1
Link to comment
Share on other sites

  • 1 month later...
  • 2 weeks later...

Hi Alessio,

when I try to access the modules configuration page I get the following error:

Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) 

Am I missing some dependencies here?

Link to comment
Share on other sites

Hi Alessio,

when I try to access the modules configuration page I get the following error:

Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) 

Am I missing some dependencies here?

Hi Timo,

the _() function stands for gettext() function (see here: http://php.net/manual/en/function.gettext.php).

Maybe you have not enable the gettext module ?

  • Like 1
Link to comment
Share on other sites

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

  • Like 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

×
×
  • Create New...