查找列中最常用的非停用词

Question

查找列中最常用的非停用词

Pot*_*ato 7 postgresql full-text-search string-manipulation

简而言之，我有一个包含普通散文的 Postgres 列，我想确定所有行中x最常用的单词（“单词”是由空格分隔的一组字符，但不是停用词）。

我找到了两个几乎达到目标的解决方案：

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('english', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

Run Code Online (Sandbox Code Playgroud)

这很好，除了它返回词干。

SELECT   UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM     item 
GROUP    BY 1 
ORDER    BY 2 DESC
LIMIT    50;

Run Code Online (Sandbox Code Playgroud)

这个返回完整的词，但包括停用词。

为简单起见：应该在上找到停用词TABLE stop_words (lowercase_stopword text PRIMARY KEY)。

有人可以帮我上网吗？

Answer 1

Erw*_*ter 6

您的第一个查询非常接近。要删除不需要的词干，请使用不执行此操作的简单词典创建文本搜索配置。

我建议对文本搜索对象使用单独的模式，但这完全是可选的：

CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';

CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
    TEMPLATE = pg_catalog.simple
  , STOPWORDS = english
);

CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER  TEXT SEARCH CONFIGURATION ts.english_simple
   ALTER MAPPING FOR asciiword WITH ts.english_simple_dict;  -- 1, 'Word, all ASCII'

Run Code Online (Sandbox Code Playgroud)

然后你的查询就可以工作了，而且速度也非常快：

SELECT *                                       
FROM   ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$) 
ORDER  BY ndoc DESC
LIMIT  50;

Run Code Online (Sandbox Code Playgroud)

dbfiddle在这里

此操作适用于小写单词，无需词干，并且不会因非 ASCII 字母而中断。

背景

阅读手册中的简单词典一章。

“词”的确切定义是一个棘手的问题。默认文本搜索解析器（目前是唯一的）可识别 23 种不同类型的标记。看：

SELECT * FROM ts_token_type('default');
Run Code Online (Sandbox Code Playgroud)
内置文本搜索配置将其中大部分映射到（内置）字典。配置的映射english：

SELECT tt.*, m.mapdict::regdictionary AS dictionary FROM pg_ts_config_map m LEFT JOIN ts_token_type(3722) tt ON tt.tokid = m.maptokentype WHERE mapcfg = 'english'::regconfig -- 'ts.english_simple'::regconfig ORDER BY tt.tokid;
Run Code Online (Sandbox Code Playgroud)
上面的演示基于simple配置创建了一个新的配置，并且由于所有英文停用词都是“asciiword”类型，因此我们只需要映射此类型即可删除停用词，无需词干或其他任何内容。

Answer 2

joa*_*olo 1

这将为您提供预期的输出：

-- Some example data
WITH titles(title) AS
(
   VALUES 
      ('This is a title'), 
      ('This is another title'), 
      ('This is finally a third title'), 
      ('and I don''t like Mondays')
)

-- List here all the words that you consider 'stop words'
-- in lowercase
, stop_words(word) AS
(

    VALUES ('the'), ('a'), ('and')
)

-- Make list of (lowercased) found words
, found_lower_words AS
(
SELECT 
    lower(unnest(string_to_array(title, ' '))) AS word
FROM
    titles
)

-- And now anti-join with the stop_words, group and count
SELECT
    word, count(*) AS word_count
FROM
    found_lower_words
    LEFT JOIN stop_words USING(word)
WHERE
    stop_words.word is NULL
GROUP BY
    word
ORDER BY
    word_count DESC, word ASC
LIMIT
    50 ;

Run Code Online (Sandbox Code Playgroud)

结果将是：

  |---------+---|
  |   is    | 3 |
  |---------+---|
  |  this   | 3 |
  |---------+---|
  |  title  | 3 |
  |---------+---|
  | another | 1 |
  |---------+---|
  |  don't  | 1 |
  |---------+---|
  | finally | 1 |
  |---------+---|
  |    i    | 1 |
  |---------+---|
  |  like   | 1 |
  |---------+---|
  | mondays | 1 |
  |---------+---|
  |  third  | 1 |
  |---------+---|

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，1 月前
查看次数：	2542 次
最近记录：	7 年，12 月前