将国际字符映射到多个选项

Question

将国际字符映射到多个选项

Sha*_*nas 9 elasticsearch

我想要实现的是人们在没有语言意识的情况下搜索个人的能力,但不会惩罚那些人.我的意思是:

鉴于我建立索引:

约根森
约根森
约根森

我希望能够允许这样的转换:

ö到o
ö到oe
ø到oe
ø到oe

所以,如果有人搜索:QUERY | 结果(我只包括ID,但它实际上是完整的记录)

约根森回归 - 1,2,3
约根森回归 - 1,2
Jørgensen回归 - 1,3
约根根回归 - 2,3

从那开始我尝试创建索引分析器并过滤:

{
"settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "ö => o",
            "ö => oe"
          ]
        }
      }
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

但这是无效的,因为它试图映射到相同的字符.

我错过了什么？我需要多个分析仪吗？任何方向将不胜感激.

Answer 1

xec*_*cgr 1

由于自定义映射在您的情况下还不够，如上面的评论所示，让我们来处理您的数据和字符规范化。
\n在您的情况下，unidecode由于 \xc3\xb8 和 oe 转换，标准化使用还不够。例子：

\n\n

import unicodedata\ndef strip_accents(s):\n    return \'\'.join(\n        c for c in unicodedata.normalize(\'NFD\', s)\n        if unicodedata.category(c) != \'Mn\'\n    )\n\nbody_matches = [\n    u\'Jorgensen\',\n    u\'J\xc3\xb6rgensen\',\n    u\'J\xc3\xb8rgensen\',\n    u\'Joergensen\',\n]\nfor b in body_matches:\n    print b,strip_accents(b)\n\n>>>> Jorgensen Jorgensen\n>>>> J\xc3\xb6rgensen Jorgensen\n>>>> J\xc3\xb8rgensen J\xc3\xb8rgensen\n>>>> Joergensen Joergensen\n

Run Code Online (Sandbox Code Playgroud)\n\n

因此，我们需要定制翻译。到目前为止，我只设置了您显示的那些字符，但请随意完成列表。

\n\n

accented_letters = {\n    u\'\xc3\xb6\' : [u\'o\',u\'oe\'],\n    u\'\xc3\xb8\' : [u\'o\',u\'oe\'],\n}\n

Run Code Online (Sandbox Code Playgroud)\n\n

然后，我们可以规范化单词并将它们存储在特殊属性中，body_normalized例如，并将它们索引为 Elasticsearch 记录的字段
\n插入它们后，您可以执行两种类型的搜索：

\n\n

精确搜索：用户输入未标准化，Elasticsearch 查询搜索body字段也未标准化。
类似的搜索。用户输入已标准化，我们将再次搜索\nbody_normalized字段

\n\n

让我们看一个例子

\n\n

body_matches = [\n    u\'Jorgensen\',\n    u\'J\xc3\xb6rgensen\',\n    u\'J\xc3\xb8rgensen\',\n    u\'Joergensen\',\n]\nprint "------EXACT MATCH------"\nfor body_match in body_matches:\n    elasticsearch_query = {\n        "query": {\n            "match" : {\n                "body" : body_match\n            }\n        }\n    }\n    es_kwargs = { \n        "doc_type"  : "your_type", \n        "index" : \'your_index\', \n        "body" : elasticsearch_query\n    }\n\n    res = es.search(**es_kwargs)\n    print body_match," MATCHING BODIES=",res[\'hits\'][\'total\']\n\n    for r in res[\'hits\'][\'hits\']:\n        print "-",r[\'_source\'].get(\'body\',\'\')\n\nprint "\\n------SIMILAR MATCHES------"\nfor body_match in body_matches:\n    body_match = normalize_word(body_match)\n    elasticsearch_query = {\n        "query": {\n            "match" : {\n                "body_normalized" : body_match\n            }\n        }\n    }\n    es_kwargs = { \n        "doc_type"  : "your_type", \n        "index" : \'your_index\', \n        "body" : elasticsearch_query\n    }\n\n    res = es.search(**es_kwargs)\n    print body_match," MATCHING NORMALIZED BODIES=",res[\'hits\'][\'total\']\n\n    for r in res[\'hits\'][\'hits\']:\n        print "-",r[\'_source\'].get(\'body\',\'\')\n

Run Code Online (Sandbox Code Playgroud)\n\n

您可以在中看到一个正在运行的示例您可以在此笔记本

\n

归档时间：	9 年前
查看次数：	187 次
最近记录：	8 年，11 月前