Jump to content

word frequency from text fields and textareas


tires
 Share

Recommended Posts

Hi!

Is there a module or a best practise to find the most frequent words used in text fields and textarea fields of pages (a few hundrets).

I want to build some kind of "word" cloud with these words.

Thank you an have a nice day!

Link to comment
Share on other sites

Hello,

interesting topic. Here is one example (please read comments inside script):

// !!! configuration
$desired_templates = array('basic-page');
$desired_fields = array('body', 'summary');
$stop_words = array('and', 'is', 'for', 'a', 'the', 'to', 'of');
$limit_list = 10;

// most used words list
$words = array();

// target templates where we search 
$selector = implode("|", $desired_templates);

// get all desired pages
$content = $pages->find("template=$selector");

// fill words array
foreach($content as $item){
	foreach($desired_fields as $f){
		$words[] = $item->{$f};
	}  
}

// https://stackoverflow.com/questions/3175390/most-used-words-in-text-with-php
function most_frequent_words($string, $stop_words = [], $limit = 5) {
    $string = strtolower($string); // Make string lowercase

    $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string
    $words = array_diff($words, $stop_words); // Remove black-list words from the array
    $words = array_count_values($words); // Count the number of occurrence

    arsort($words); // Sort based on count

    return array_slice($words, 0, $limit); // Limit the number of words and returns the word array
}

// output
var_dump(most_frequent_words(implode(' ', $words), $stop_words, $limit_list));

Bad thing in this example is if you do this query in fronted runtime, better place this in some specific template and use PW cache.
Also, maybe the best option is to create some Ajax driven module in backend and generate there words list.

Regards.

  • Like 5
Link to comment
Share on other sites

WOW!!! Thanks a lot!!!
Works perfect!

I just changed the code for output:

// output
$words = most_frequent_words(implode(' ', $words), $stop_words, $limit_list);
foreach ($words as $word => $value) {
   echo $word."(".$value.")";
}

And i used this german "stopword list": https://github.com/stopwords-iso/stopwords-de

To have some kind of "page statitics" module (used words, number of pages, number of comments, last edited pages ... etc.) would be a great idea.
 

Link to comment
Share on other sites

For some strange reasons the script outputs two words for the german word "präsident": "pr" and "sident".

Probabely the script stops at "ä".
But in the text fields the code isn't encoded this way.

Or is there a problem with "ä" etc.?

Any ideas???

Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
 Share

  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...