最常用的文字用php

use*_*est 3 php string stop-words word-frequency

我在stackoverflow上找到了下面的代码,它可以很好地找到字符串中最常见的单词.但是,我可以排除对"a,if,you,have等"等常用词的统计吗?或者我必须在计数后删除元素?我该怎么做?提前致谢.

<?php

$text = "A very nice to tot to text. Something nice to think about if you're into text.";


$words = str_word_count($text, 1); 

$frequency = array_count_values($words);

arsort($frequency);

echo '<pre>';
print_r($frequency);
echo '</pre>';
?>
Run Code Online (Sandbox Code Playgroud)

Kha*_*led 9

这是一个从字符串中提取常用单词的函数.它需要三个参数; 字符串,停止字数组和关键字计数.你必须使用PHP函数从txt文件中获取stop_words,将txt文件转换为数组

$ stop_words = file('stop_words.txt',FILE_IGNORE_NEW_LINES | FILE_SKIP_EMPTY_LINES);

$ this-> extract_common_words($ text,$ stop_words)

您可以使用此文件stop_words.txt作为主要停用词文件,或创建自己的文件.

function extract_common_words($string, $stop_words, $max_count = 5) {
      $string = preg_replace('/ss+/i', '', $string);
      $string = trim($string); // trim the string
      $string = preg_replace('/[^a-zA-Z -]/', '', $string); // only take alphabet characters, but keep the spaces and dashes too…
      $string = strtolower($string); // make it lowercase

      preg_match_all('/\b.*?\b/i', $string, $match_words);
      $match_words = $match_words[0];

      foreach ( $match_words as $key => $item ) {
          if ( $item == '' || in_array(strtolower($item), $stop_words) || strlen($item) <= 3 ) {
              unset($match_words[$key]);
          }
      }  

      $word_count = str_word_count( implode(" ", $match_words) , 1); 
      $frequency = array_count_values($word_count);
      arsort($frequency);

      //arsort($word_count_arr);
      $keywords = array_slice($frequency, 0, $max_count);
      return $keywords;
}
Run Code Online (Sandbox Code Playgroud)