Sas*_* B. 7 postgresql dictionary full-text-search snowball ispell
解析许多文档后,我有很多包含乌克兰语文本的行/列,应该为 Postgres 中的全文搜索建立索引。
我发现 Postgres 14 默认支持29 种语言,但不幸的是不支持乌克兰语。
经过后续挖掘,我发现它允许添加外部字典:
CREATE TEXT SEARCH DICTIONARY my_lang_ispell (
TEMPLATE = ispell,
DictFile = path_to_my_lang_dict_file,
AffFile = path_to_my_lang_affixes_file,
StopWords = path_to_my_lang_astop_words_file
);
Run Code Online (Sandbox Code Playgroud)
但如何找到最相关的DictFile、AffFile、 和StopWords文件呢?例如,snowball源不包含此语言。
那么,有人可以帮助我找到获取ispell、aspell、snowball或其他乌克兰语词典的最佳方法吗?
谢谢!
经过更深入的探索,在此资源dict_uk上找到了解决方案
\nsudo snap install gradle\n\n$ cd dict_uk\n$ ./gradlew expand\n\n$ cd distr/hunspell/\n\n$ ../../gradlew hunspell\n\n$ sudo cp build/hunspell/uk_UA.aff /usr/share/postgresql/12/tsearch_data/uk_ua.affix\n$ sudo cp build/hunspell/uk_UA.dic /usr/share/postgresql/12/tsearch_data/uk_ua.dict\n$ sudo cp ../postgresql/ukrainian.stop /usr/share/postgresql/12/tsearch_data/ukrainian.stop\nRun Code Online (Sandbox Code Playgroud)\n或者只需从此处下载并解压最新的 hunspell-uk_UA_X.XXzip和停止词文件
\nukrainian$ sudo cp uk_UA.aff $(pg_config --sharedir)/tsearch_data/uk_ua.affix\n$ sudo cp uk_UA.dic $(pg_config --sharedir)/tsearch_data/uk_ua.dict\n$ sudo cp ukrainian.stop $(pg_config --sharedir)/tsearch_data/ukrainian.stop\nRun Code Online (Sandbox Code Playgroud)\n$ sudo su postgres\n$ psql\n\nCREATE TEXT SEARCH DICTIONARY ukrainian_huns (TEMPLATE = ispell, DictFile = uk_ua, AffFile = uk_ua, StopWords = ukrainian);\n\nCREATE TEXT SEARCH DICTIONARY ukrainian_stem (template = simple, stopwords = ukrainian);\n\nCREATE TEXT SEARCH CONFIGURATION ukrainian (PARSER=default);\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR hword, hword_part, word WITH ukrainian_huns, ukrainian_stem;\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR int, uint, numhword, numword, hword_numpart, email, float, file, url, url_path, version, host, sfloat WITH simple;\n\nALTER TEXT SEARCH CONFIGURATION ukrainian ALTER MAPPING FOR asciihword, asciiword, hword_asciipart WITH english_stem;\n\n# \\dFd\n...\n pg_catalog | english_stem | snowball stemmer for english language\n...\n public | ukrainian_huns | \n public | ukrainian_stem | \nRun Code Online (Sandbox Code Playgroud)\n现在可以在以下命令的帮助下创建可搜索列to_tsvector:
ALTER TABLE extracted_pages\n ADD COLUMN tsvector_uk tsvector GENERATED ALWAYS AS (\n setweight(to_tsvector(\'ukrainian\', coalesce(column_with_text, \'\')), \'A\')\n ) STORED;\nRun Code Online (Sandbox Code Playgroud)\n此示例显示了乌克兰语的正确词干:
\nSELECT to_tsvector(\'ukrainian\', \'\xd1\x81\xd0\xbe\xd0\xbb\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbe \xd0\xb4\xd0\xb7\xd1\x8e\xd1\x80\xd1\x87\xd0\xb8\xd1\x82\xd1\x8c \xd0\xb4\xd0\xb6\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd0\xbe \xd1\x96 \xd1\x85\xd0\xbe\xd1\x87\xd0\xb5\xd1\x82\xd1\x8c\xd1\x81\xd1\x8f \xd0\xb6\xd0\xb8\xd1\x82\xd0\xb8, \xd0\xbb\xd1\x8e\xd0\xb1\xd0\xb8\xd1\x82\xd0\xb8, \xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb8\xd1\x82\xd0\xb8... \');\n => [{"to_tsvector"=>"\'\xd0\xb4\xd0\xb6\xd0\xb5\xd1\x80\xd0\xb5\xd0\xbb\xd0\xbe\':3 \'\xd0\xb4\xd0\xb7\xd1\x8e\xd1\x80\xd1\x87\xd0\xb0\xd1\x82\xd0\xb8\':2 \'\xd0\xb6\xd0\xb8\xd1\x82\xd0\xb8\':6 \'\xd0\xbb\xd1\x8e\xd0\xb1\xd0\xb8\xd1\x82\xd0\xb8\':7 \'\xd1\x81\xd0\xbe\xd0\xbb\xd0\xbe\xd0\xb4\xd0\xba\xd0\xbe\':1 \'\xd1\x82\xd0\xb2\xd0\xbe\xd1\x80\xd0\xb8\xd1\x82\xd0\xb8\':8 \'\xd1\x85\xd0\xbe\xd1\x87\xd0\xb5\xd1\x82\xd1\x8c\xd1\x81\xd1\x8f\':5"}]\nRun Code Online (Sandbox Code Playgroud)\nPostgres 全文搜索在质量方面与类似的搜索文本引擎SphinxSearch一样,但速度稍慢。
\n对于大量记录 (278_000) 的同一查询,它返回相同的结果:
\nPostgres - ActiveRecord: 67.6ms\nSphinxSearch - ActiveRecord: 10.9ms\n\nOS: Ubuntu 20.04\nRun Code Online (Sandbox Code Playgroud)\n非常感谢dict_uk支持团队!
\n