PostgreSQL 的 to_tsvector 函数可以返回标记/单词而不是词素吗？

Question

PostgreSQL 的 to_tsvector 函数可以返回标记/单词而不是词素吗？

tur*_*nip 4 postgresql nlp lemmatization

PostgreSQL 的to_tsvector功能非常有用，但就我的数据集而言，它的作用比我想要的要多一些。

例如：

select * 
from to_tsvector('english', 'This is my favourite game. I enjoy everything about it.');

Run Code Online (Sandbox Code Playgroud)

产生：'enjoy':7 'everyth':8 'favourit':4 'game':5

我并不担心停用词被过滤掉，这很好。但有些词会被完全毁掉，比如everythingand favourite。

有没有办法修改这种行为，或者是否有不同的函数可以做到这一点？

PS：是的，我可以编写自己的查询来执行此操作（并且我已经这样做了），但我想要一种更快的方法。

Answer 1

Mad*_*ist 5

您看到但不想要的行为是“阻止”。如果您不希望这样，则必须对 to_tsvector 使用不同的字典。“简单”字典不进行词干提取，因此它应该适合您的用例。

select * 
from to_tsvector('simple', 'This is my favourite game. I enjoy everything about it.');

Run Code Online (Sandbox Code Playgroud)

产生以下输出

“关于”：9“享受”：7“一切”：8“最喜欢”：4“游戏”：5“我”：6“是”：2“它”：10“我的”：3“这个”：1

如果您仍然想删除停用词，据我所知，您必须定义自己的字典。请参阅下面的示例，但您可能需要阅读文档以确保这完全符合您的要求。

CREATE TEXT SEARCH DICTIONARY only_stop_words (
    Template = pg_catalog.simple,
    Stopwords = english
);
CREATE TEXT SEARCH CONFIGURATION public.only_stop_words ( COPY = pg_catalog.simple );
ALTER TEXT SEARCH CONFIGURATION public.only_stop_words ALTER MAPPING FOR asciiword WITH only_stop_words;
select * 
from to_tsvector('only_stop_words', 'The This is my favourite game. I enjoy everything about it.');

Run Code Online (Sandbox Code Playgroud)

“享受”：8“一切”：9“最喜欢的”：5“游戏”：6

归档时间：	8 年，4 月前
查看次数：	1667 次
最近记录：	8 年，4 月前