Jump to content

Recommended Posts

Posted

Hi!

Is there a module or a best practise to find the most frequent words used in text fields and textarea fields of pages (a few hundrets).

I want to build some kind of "word" cloud with these words.

Thank you an have a nice day!

Posted

Hello,

interesting topic. Here is one example (please read comments inside script):

// !!! configuration
$desired_templates = array('basic-page');
$desired_fields = array('body', 'summary');
$stop_words = array('and', 'is', 'for', 'a', 'the', 'to', 'of');
$limit_list = 10;

// most used words list
$words = array();

// target templates where we search 
$selector = implode("|", $desired_templates);

// get all desired pages
$content = $pages->find("template=$selector");

// fill words array
foreach($content as $item){
	foreach($desired_fields as $f){
		$words[] = $item->{$f};
	}  
}

// https://stackoverflow.com/questions/3175390/most-used-words-in-text-with-php
function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

// output
var_dump(most_frequent_words(implode(' ', $words), $stop_words, $limit_list));

Bad thing in this example is if you do this query in fronted runtime, better place this in some specific template and use PW cache.
Also, maybe the best option is to create some Ajax driven module in backend and generate there words list.

Regards.

  • Like 5
Posted

WOW!!! Thanks a lot!!!
Works perfect!

I just changed the code for output:

// output
$words = most_frequent_words(implode(' ', $words), $stop_words, $limit_list);
foreach ($words as $word => $value) {
   echo $word."(".$value.")";
}

And i used this german "stopword list": https://github.com/stopwords-iso/stopwords-de

To have some kind of "page statitics" module (used words, number of pages, number of comments, last edited pages ... etc.) would be a great idea.
 

Posted

For some strange reasons the script outputs two words for the german word "präsident": "pr" and "sident".

Probabely the script stops at "ä".
But in the text fields the code isn't encoded this way.

Or is there a problem with "ä" etc.?

Any ideas???

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...