Postgresql全文搜索标记器

Tom*_*mmi 6 postgresql full-text-search tokenize

刚遇到问题.我正在尝试在本地化内容(特别是俄语)上设置全文搜索.问题是默认配置(以及我的自定义)不处理字母案例.例:

SELECT * from to_tsvector('test_russian', '?? ????? ????????? ????? ???????? ?????????');
> '??':1 '?????':4 '?????????':6 '?????????':3 '????????':5 '?????':2
Run Code Online (Sandbox Code Playgroud)

'На'是一个禁用词,应该删除,但它甚至不会在结果向量中降低.如果我传递小写字符串,一切正常

SELECT * from to_tsvector('test_russian', '?? ????? ????????? ????? ???????? ?????????');
> '?????':4 '?????????':6 '?????????':3 '????????':5 '?????':2
Run Code Online (Sandbox Code Playgroud)

当然,我可以传递预先小写的字符串,但手动说

简单字典模板通过将输入标记转换为小写字母并针对停用字文件进行检查来进行操作.

Config russian_test看起来像这样:

create text search CONFIGURATION test_russian (COPY = 'russian');

CREATE TEXT SEARCH DICTIONARY russian_simple (
    TEMPLATE = pg_catalog.simple,
    STOPWORDS = russian
);

CREATE TEXT SEARCH DICTIONARY russian_snowball (
    TEMPLATE = snowball,
    Language = russian,
    StopWords = russian
);

alter text search configuration test_russian 
    alter mapping for word
    with russian_simple,russian_snowball;
Run Code Online (Sandbox Code Playgroud)

但实际上我通过内置russian配置获得了完全相同的结果.

word按照我的预期尝试了ts_debug和令牌.

有任何想法吗?

Tom*_*mmi 4

问题解决了。原因是数据库是使用默认值(“C”)CTypeCollate. 我们用了

initdb --locale=UTF-8 --lc-collate=UTF-8 --encoding=UTF-8 -U pgsql *PGSQL DATA DIR* 
Run Code Online (Sandbox Code Playgroud)

重新创建实例并

CREATE DATABASE "scratch"
  WITH OWNER "postgres"
  ENCODING 'UTF8'
  LC_COLLATE = 'ru_RU.UTF-8'
  LC_CTYPE = 'ru_RU.UTF-8';
Run Code Online (Sandbox Code Playgroud)

现在可以重新创建数据库和简单字典了。