tires Posted July 30, 2018 Share Posted July 30, 2018 Hi! Is there a module or a best practise to find the most frequent words used in text fields and textarea fields of pages (a few hundrets). I want to build some kind of "word" cloud with these words. Thank you an have a nice day! Link to comment Share on other sites More sharing options...
Zeka Posted July 30, 2018 Share Posted July 30, 2018 @tires I'm not sure that there is something built in for your needs in PW, but you can easily create a custom module As the starting point, you can use these example https://github.com/benbalter/Frequency-Analysis/blob/master/frequency-analysis.php 1 Link to comment Share on other sites More sharing options...
OLSA Posted July 30, 2018 Share Posted July 30, 2018 Hello, interesting topic. Here is one example (please read comments inside script): // !!! configuration $desired_templates = array('basic-page'); $desired_fields = array('body', 'summary'); $stop_words = array('and', 'is', 'for', 'a', 'the', 'to', 'of'); $limit_list = 10; // most used words list $words = array(); // target templates where we search $selector = implode("|", $desired_templates); // get all desired pages $content = $pages->find("template=$selector"); // fill words array foreach($content as $item){ foreach($desired_fields as $f){ $words[] = $item->{$f}; } } // https://stackoverflow.com/questions/3175390/most-used-words-in-text-with-php function most_frequent_words($string, $stop_words = [], $limit = 5) { $string = strtolower($string); // Make string lowercase $words = str_word_count($string, 1); // Returns an array containing all the words found inside the string $words = array_diff($words, $stop_words); // Remove black-list words from the array $words = array_count_values($words); // Count the number of occurrence arsort($words); // Sort based on count return array_slice($words, 0, $limit); // Limit the number of words and returns the word array } // output var_dump(most_frequent_words(implode(' ', $words), $stop_words, $limit_list)); Bad thing in this example is if you do this query in fronted runtime, better place this in some specific template and use PW cache. Also, maybe the best option is to create some Ajax driven module in backend and generate there words list. Regards. 5 Link to comment Share on other sites More sharing options...
tires Posted July 31, 2018 Author Share Posted July 31, 2018 WOW!!! Thanks a lot!!! Works perfect! I just changed the code for output: // output $words = most_frequent_words(implode(' ', $words), $stop_words, $limit_list); foreach ($words as $word => $value) { echo $word."(".$value.")"; } And i used this german "stopword list": https://github.com/stopwords-iso/stopwords-de To have some kind of "page statitics" module (used words, number of pages, number of comments, last edited pages ... etc.) would be a great idea. Link to comment Share on other sites More sharing options...
tires Posted July 31, 2018 Author Share Posted July 31, 2018 For some strange reasons the script outputs two words for the german word "präsident": "pr" and "sident". Probabely the script stops at "ä". But in the text fields the code isn't encoded this way. Or is there a problem with "ä" etc.? Any ideas??? Link to comment Share on other sites More sharing options...
Recommended Posts
Create an account or sign in to comment
You need to be a member in order to leave a comment
Create an account
Sign up for a new account in our community. It's easy!
Register a new accountSign in
Already have an account? Sign in here.
Sign In Now