Pot*_*ato 7 postgresql full-text-search string-manipulation
简而言之,我有一个包含普通散文的 Postgres 列,我想确定所有行中x
最常用的单词(“单词”是由空格分隔的一组字符,但不是停用词)。
我找到了两个几乎达到目标的解决方案:
SELECT *
FROM ts_stat($$SELECT to_tsvector('english', title) FROM item$$)
ORDER BY ndoc DESC
LIMIT 50;
Run Code Online (Sandbox Code Playgroud)
这很好,除了它返回词干。
SELECT UNNEST(string_to_array(title, ' ')) AS word, COUNT(*) AS ct
FROM item
GROUP BY 1
ORDER BY 2 DESC
LIMIT 50;
Run Code Online (Sandbox Code Playgroud)
这个返回完整的词,但包括停用词。
为简单起见:应该在 上找到停用词TABLE stop_words (lowercase_stopword text PRIMARY KEY)
。
有人可以帮我上网吗?
您的第一个查询非常接近。要删除不需要的词干,请使用不执行此操作的简单词典创建文本搜索配置。
我建议对文本搜索对象使用单独的模式,但这完全是可选的:
CREATE SCHEMA ts;
GRANT USAGE ON SCHEMA ts TO public;
COMMENT ON SCHEMA ts IS 'text search objects';
CREATE TEXT SEARCH DICTIONARY ts.english_simple_dict (
TEMPLATE = pg_catalog.simple
, STOPWORDS = english
);
CREATE TEXT SEARCH CONFIGURATION ts.english_simple (COPY = simple);
ALTER TEXT SEARCH CONFIGURATION ts.english_simple
ALTER MAPPING FOR asciiword WITH ts.english_simple_dict; -- 1, 'Word, all ASCII'
Run Code Online (Sandbox Code Playgroud)
然后你的查询就可以工作了,而且速度也非常快:
SELECT *
FROM ts_stat($$SELECT to_tsvector('ts.english_simple', title) FROM item$$)
ORDER BY ndoc DESC
LIMIT 50;
Run Code Online (Sandbox Code Playgroud)
dbfiddle在这里
此操作适用于小写单词,无需词干,并且不会因非 ASCII 字母而中断。
“词”的确切定义是一个棘手的问题。默认文本搜索解析器(目前是唯一的)可识别 23 种不同类型的标记。看:
SELECT * FROM ts_token_type('default');
Run Code Online (Sandbox Code Playgroud)
内置文本搜索配置将其中大部分映射到(内置)字典。配置的映射english
:
SELECT tt.*, m.mapdict::regdictionary AS dictionary
FROM pg_ts_config_map m
LEFT JOIN ts_token_type(3722) tt ON tt.tokid = m.maptokentype
WHERE mapcfg = 'english'::regconfig -- 'ts.english_simple'::regconfig
ORDER BY tt.tokid;
Run Code Online (Sandbox Code Playgroud)
上面的演示基于simple
配置创建了一个新的配置,并且由于所有英文停用词都是“asciiword”类型,因此我们只需要映射此类型即可删除停用词,无需词干或其他任何内容。
这将为您提供预期的输出:
-- Some example data
WITH titles(title) AS
(
VALUES
('This is a title'),
('This is another title'),
('This is finally a third title'),
('and I don''t like Mondays')
)
-- List here all the words that you consider 'stop words'
-- in lowercase
, stop_words(word) AS
(
VALUES ('the'), ('a'), ('and')
)
-- Make list of (lowercased) found words
, found_lower_words AS
(
SELECT
lower(unnest(string_to_array(title, ' '))) AS word
FROM
titles
)
-- And now anti-join with the stop_words, group and count
SELECT
word, count(*) AS word_count
FROM
found_lower_words
LEFT JOIN stop_words USING(word)
WHERE
stop_words.word is NULL
GROUP BY
word
ORDER BY
word_count DESC, word ASC
LIMIT
50 ;
Run Code Online (Sandbox Code Playgroud)
结果将是:
|---------+---|
| is | 3 |
|---------+---|
| this | 3 |
|---------+---|
| title | 3 |
|---------+---|
| another | 1 |
|---------+---|
| don't | 1 |
|---------+---|
| finally | 1 |
|---------+---|
| i | 1 |
|---------+---|
| like | 1 |
|---------+---|
| mondays | 1 |
|---------+---|
| third | 1 |
|---------+---|
Run Code Online (Sandbox Code Playgroud)