Jump to content

Recommended Posts

Posted

Hi all!

I have created this new module that improve the current search engine on PW:

https://github.com/USSliberty/Processwire-site-indexer (Beta version)

The main idea is to create an hidden field that store keywords (separated by space). The keywords are generated automatically from all text fields in the page, plus PDFs and DOCs files.

So if you create new text fields you can forget to add they on Search Page Module.

The only thing to do after install, is to change the list of fields in Search Page (see attachment). In fact you need to search only in "indexer" field.

post-834-0-54191500-1368028283_thumb.png

NOTE 1: At this time the module index only when you save the page. In the next week maybe i will add the complete-site re-index.

NOTE 2: The files are indexed with 2 Unix packages (poppler-utils & wv). I have tried without success with pure PHP classes, but if know a class that works fine i can add it into module.

ADB

  • Like 15
Posted

Looking forward to checking this out - thanks!

I have been using the attached set of functions for a long time to extract text from PDFs. Probably not as powerful as poppler, but might do what you need. I made some poorly documented changes to the original. Anyway, maybe you'll find something useful in there.

pdf2txt.php

  • Like 1
Posted

Sorry you had no luck with that class. It has been working well for me - several hundred PDFs with no failures so far.

I have attached one that definitely works so maybe you can figure out what the issue might be.

EDIT: Notice that in the main function I changed it so it always uses the handleV2 function. The V3 one wasn't working for me, but you might want to look into that some more.

ian_newsletter_405 (2).pdf

Posted

That is weird. I looked back at my php source downloads and I was running 5.3.3 at some point back in 2010 and that script was still working then. It is strange that none of those options are working. Let me know if there is any testing I can do at my end for you that might help.

When I get a minute or two I might try implementing that class into your module and see how it works for me.

Posted

Just installed the module for the first time and received this error:

Notice: Undefined index: maxlength in /xxx/site/modules/Indexer/Indexer.module on line 185

Posted

Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point).

Can you try this and see if it works for you at least?

pdf2txt.php

Posted (edited)
 Ok, I just tried integrating that original pdf2txt script I sent you into your module and it always returns nothing. However if I place the attached version in a web accessible location, edit the last line to point to a PDF file, it works perfectly (albeit with lots of non-fatal php errors that should be dealt with at some point).

 

Can you try this and see if it works for you at least?

It's true, it works alone but if i integrate into my module doesn't return nothing. Very weird...

Edited by Alessio Dal Bianco
Posted

Ok, updated GIT & Modules directory here (0.3.0 now)

• Now you can optionally select to use the PHP class or poppler.

• On the uninstall now i remove the indexer field.

• More normalization of the text stored in the indexer field.

USSliberty

Posted

Awesome, I'll be using this with a couple of projects that are just winding up. I'll let you know if I come across any issues. If I find time I might try to go through and clean up that class too - there is a fair bit of unneeded code in there and lots of undefined variables.

Thanks for your hard work on this.

  • Like 1
  • 1 month later...
  • 2 weeks later...
Posted

Hi Alessio,

when I try to access the modules configuration page I get the following error:

Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) 

Am I missing some dependencies here?

Posted

Hi Alessio,

when I try to access the modules configuration page I get the following error:

Error: Call to undefined function _() (line 84 of C:\Work\sgh\website\htdocs\site\modules\Processwire-site-indexer-master\Indexer.module) 

Am I missing some dependencies here?

Hi Timo,

the _() function stands for gettext() function (see here: http://php.net/manual/en/function.gettext.php).

Maybe you have not enable the gettext module ?

  • Like 1
Posted

Oh, I didn't know it's a default php function. 

I enabled the module in my php.ini, now everything works fine.

Thanks for the help! 

  • Like 1
Posted

This is perhaps an obvious feature creep suggestion but you might want to add the option to remove "noise words" like a, the, of, etc. There are lists on the net. These are words that are too common to be meaningful keywords.

  • Like 1

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
×
×
  • Create New...