同义词过滤器的不同位置增量行为

edge_ngrams我的用例是在支持下搜索synonym要匹配的令牌应按顺序排列的位置。

在尝试分析时，我观察到过滤器链在位置增量方面有两种不同的行为。

使用过滤器链，因为lowercase, synonym没有位置增量SynonymFilter
使用过滤器链，因为lowercase, edge_ngram, synonym存在位置增量SynonymFilter

以下是我针对每种情况运行的查询：

情况1.无位置增量

PUT synonym_test
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_synonym"
          ]
        }
      },
      "filter": {
        "custom_synonym": {
          "type": "synonym",
          "synonyms": [
            "begin => start"
          ]
        }
      }
    }
  }
}


GET synonym_test/_analyze
{
  "text": "begin working",
  "analyzer": "by_smart"
}

Run Code Online (Sandbox Code Playgroud)

输出：

{
  "tokens": [
    {
      "token": "start",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "working",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 1
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

情况2.仓位增量

PUT synonym_test
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_edge_ngram",
            "custom_synonym"
          ]
        }
      },
      "filter": {
        "custom_synonym": {
          "type": "synonym",
          "synonyms": [
            "begin => start"
          ]
        },
        "custom_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": "2",
          "max_gram": "60"
        }
      }
    }
  }
}

GET synonym_test/_analyze
{
  "text": "begin working",
  "analyzer": "by_smart"
}

Run Code Online (Sandbox Code Playgroud)

输出：

{
  "tokens": [
    {
      "token": "be",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "beg",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "begi",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "start",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "work",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "worki",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "workin",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "working",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

请注意，在Case1中，标记begin和start替换时具有相同的位置，并且没有位置增量。然而，在情况 2中，当begintoken 被替换start为后续 token 流增量的位置时。

现在我的问题是：

为什么情况 1中没有发生而只发生在情况 2中？
这导致的主要问题是当输入查询begi wor与match_phrase查询（默认slop为0）时它不匹配begin work。这是从那时起发生的begi，wor距离 2 个位置。关于如何在不影响我的用例的情况下实现此行为有什么建议吗？

我正在使用5.6.8具有 lucene 版本的ElasticSearch 版本6.6.1。

我已经阅读了几个文档链接和文章，但我找不到任何合适的文章来解释为什么会发生这种情况，以及是否有一些设置可以实现我想要的行为。

归档时间：	6 年，2 月前
查看次数：	463 次
最近记录：	6 年，2 月前