在哪里可以找到乌克兰语 'ispell'、'aspell'、'snowball' 字典以将其添加到 Postgres 的全文搜索中?

Sas*_* B. 7 postgresql dictionary full-text-search snowball ispell

解析许多文档后,我有很多包含乌克兰语文本的行/列,应该为 Postgres 中的全文搜索建立索引。

我发现 Postgres 14 默认支持29 种语言,但不幸的是不支持乌克兰语。

经过后续挖掘,我发现它允许添加外部字典

CREATE TEXT SEARCH DICTIONARY my_lang_ispell (
    TEMPLATE = ispell,
    DictFile = path_to_my_lang_dict_file,
    AffFile = path_to_my_lang_affixes_file,
    StopWords = path_to_my_lang_astop_words_file
);
Run Code Online (Sandbox Code Playgroud)

但如何找到最相关的DictFileAffFile、 和StopWords文件呢?例如,snowball源不包含此语言。

那么,有人可以帮助我找到获取ispellaspellsnowball或其他乌克兰语词典的最佳方法吗?

谢谢!

Sas*_* B. 9

经过更深入的探索,在此资源dict_uk上找到了解决方案

\n
    \n
  1. 按照本指南手动编译文件:
  2. \n
\n
sudo snap install gradle\n\n$ cd dict_uk\n$ ./gradlew expand\n\n$ cd distr/hunspell/\n\n$ ../../gradlew hunspell\n\n$ sudo cp build/hunspell/uk_UA.aff /usr/share/postgresql/12/tsearch_data/uk_ua.affix\n$ sudo cp build/hunspell/uk_UA.dic /usr/share/postgresql/12/tsearch_data/uk_ua.dict\n$ sudo cp ../postgresql/ukrainian.stop /usr/share/postgresql/12/tsearch_data/ukrainian.stop\n
Run Code Online (Sandbox Code Playgroud)\n

或者只需从此处下载并解压最新的 hunspell-uk_UA_X.XXzip和停止词文件

\n
    \n
  1. 按照Postgres 中设置语言的指南进行操作:ukrainian
  2. \n
\n
$ sudo cp uk_UA.aff $(pg_config --sharedir)/tsearch_data/uk_ua.affix\n$ sudo cp uk_UA.dic $(pg_config --sharedir)/tsearch_data/uk_ua.dict\n$ sudo cp ukrainian.stop $(pg_config --sharedir)/tsearch_data/ukrainian.stop\n
Run Code Online (Sandbox Code Playgroud)\n
$ sudo su postgres\n$ psql\n\nCREATE TEXT SEARCH DICTIONARY ukrainian_huns (TEMPLATE = ispell, DictFile = uk_ua, AffFile = uk_ua, StopWords = ukrainian);\n\nCREATE TEXT SEARCH DICTIONARY ukrainian_stem (template = simple, stopwords = ukrainian);\n\nCREATE TEXT SEARCH CONFIGURATION ukrainian (PARSER=default);\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR  hword, hword_part, word WITH ukrainian_huns, ukrainian_stem;\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR  int, uint, numhword, numword, hword_numpart, email, float, file, url, url_path, version, host, sfloat WITH simple;\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR asciihword, asciiword, hword_asciipart WITH english_stem;\n\n# \\dFd\n...\n pg_catalog | english_stem    | snowball stemmer for english language\n...\n public     | ukrainian_huns  | \n public     | ukrainian_stem  | \n
Run Code Online (Sandbox Code Playgroud)\n

现在可以在以下命令的帮助下创建可搜索列to_tsvector

\n
      ALTER TABLE extracted_pages\n      ADD COLUMN tsvector_uk tsvector GENERATED ALWAYS AS (\n        setweight(to_tsvector(\'ukrainian\', coalesce(column_with_text, \'\')), \'A\')\n      ) STORED;\n
Run Code Online (Sandbox Code Playgroud)\n

此示例显示了乌克兰语的正确词干:

\n
SELECT to_tsvector(\'ukrainian\', \'\xd1\x81\xd0\xbe\xd0\xbb\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbe \xd0\xb4\xd0\xb7\xd1\x8e\xd1\x80\xd1\x87\xd0\xb8\xd1\x82\xd1\x8c \xd0\xb4\xd0\xb6\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd0\xbe \xd1\x96 \xd1\x85\xd0\xbe\xd1\x87\xd0\xb5\xd1\x82\xd1\x8c\xd1\x81\xd1\x8f \xd0\xb6\xd0\xb8\xd1\x82\xd0\xb8, \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xb8\xd1\x82\xd0\xb8, \xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb8\xd1\x82\xd0\xb8... \');\n => [{"to_tsvector"=>"\'\xd0\xb4\xd0\xb6\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd0\xbe\':3 \'\xd0\xb4\xd0\xb7\xd1\x8e\xd1\x80\xd1\x87\xd0\xb0\xd1\x82\xd0\xb8\':2 \'\xd0\xb6\xd0\xb8\xd1\x82\xd0\xb8\':6 \'\xd0\xbb\xd1\x8e\xd0\xb1\xd0\xb8\xd1\x82\xd0\xb8\':7 \'\xd1\x81\xd0\xbe\xd0\xbb\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbe\':1 \'\xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb8\xd1\x82\xd0\xb8\':8 \'\xd1\x85\xd0\xbe\xd1\x87\xd0\xb5\xd1\x82\xd1\x8c\xd1\x81\xd1\x8f\':5"}]\n
Run Code Online (Sandbox Code Playgroud)\n

结果

\n

Postgres 全文搜索在质量方面与类似的搜索文本引擎SphinxSearch一样,但速度稍慢。

\n

对于大量记录 (278_000) 的同一查询,它返回相同的结果:

\n
Postgres     - ActiveRecord: 67.6ms\nSphinxSearch - ActiveRecord: 10.9ms\n\nOS: Ubuntu 20.04\n
Run Code Online (Sandbox Code Playgroud)\n

非常感谢dict_uk支持团队!

\n