elasticsearch多词关键词 - 标记化同义词分析

Jef*_*eff 8 synonym elasticsearch

我正在尝试使用_analyze API获取关键字标记化的多词同义词.API返回单字同义词的预期结果,但不是多字的同义词.这是我的设置和分析链:

curl -XPOST "http://localhost:9200/test" -d'
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_syn_filt": {
            "type": "synonym",
            "synonyms": [
              "foo bar, fooo bar", 
              "bazzz, baz"
            ]
          }
        },
        "analyzer": {
          "my_synonyms": {
            "filter": [
              "lowercase",
              "my_syn_filt"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)

现在使用_analyze API进行测试:

curl 'localhost:9200/test/_analyze?analyzer=my_synonyms&text=baz'
Run Code Online (Sandbox Code Playgroud)

该调用返回我期望的内容(同样返回'bazzz'的结果):

{
  "tokens": [
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 3,
      "start_offset": 0,
      "token": "bazzz"
    },
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 3,
      "start_offset": 0,
      "token": "baz"
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

现在,当我使用多字同义词文本尝试相同的调用时,API只返回一个类型为'word'的标记,没有同义词:

curl 'localhost:9200/test/_analyze?analyzer=my_synonyms&text=foo+bar'
Run Code Online (Sandbox Code Playgroud)

(返回)

{
  "tokens": [
    {
      "position": 1,
      "type": "word",
      "end_offset": 7,
      "start_offset": 0,
      "token": "foo bar"
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

为什么分析API不返回类型为SYNONYM的"foo bar"和"fooo bar"标记?

Jef*_*eff 13

"tokenizer":"keyword"键值也需要添加到my_syn_filt过滤器声明中,如下所示:

curl -XPOST "http://localhost:9200/test" -d'
{
  "settings": {
    "index": {
      "analysis": {
        "filter": {
          "my_syn_filt": {
            "tokenizer": "keyword",
            "type": "synonym",
            "synonyms": [
              "foo bar, fooo bar", 
              "bazzz, baz"
            ]
          }
        },
        "analyzer": {
          "my_synonyms": {
            "filter": [
              "lowercase",
              "my_syn_filt"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)

通过上面的映射,_analyze API返回所需的SYNONYM标记:

{
  "tokens": [
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 7,
      "start_offset": 0,
      "token": "foo bar"
    },
    {
      "position": 1,
      "type": "SYNONYM",
      "end_offset": 7,
      "start_offset": 0,
      "token": "fooo bar"
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

  • 为什么自定义过滤器可以指定标记化器?有什么文件可以参考吗? (2认同)