同义词过滤器的不同位置增量行为

raf*_*ron 5 lucene search elasticsearch

edge_ngrams我的用例是在支持下搜索synonym要匹配的令牌应按顺序排列的位置。

在尝试分析时,我观察到过滤器链在位置增量方面有两种不同的行为。

  1. 使用过滤器链,因为lowercase, synonym没有位置增量SynonymFilter
  2. 使用过滤器链,因为lowercase, edge_ngram, synonym存在位置增量SynonymFilter

以下是我针对每种情况运行的查询:

情况1.无位置增量

PUT synonym_test
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_synonym"
          ]
        }
      },
      "filter": {
        "custom_synonym": {
          "type": "synonym",
          "synonyms": [
            "begin => start"
          ]
        }
      }
    }
  }
}


GET synonym_test/_analyze
{
  "text": "begin working",
  "analyzer": "by_smart"
}

Run Code Online (Sandbox Code Playgroud)

输出:

{
  "tokens": [
    {
      "token": "start",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 0
    },
    {
      "token": "working",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 1
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

情况2.仓位增量

PUT synonym_test
{
  "index": {
    "analysis": {
      "analyzer": {
        "by_smart": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "custom_edge_ngram",
            "custom_synonym"
          ]
        }
      },
      "filter": {
        "custom_synonym": {
          "type": "synonym",
          "synonyms": [
            "begin => start"
          ]
        },
        "custom_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": "2",
          "max_gram": "60"
        }
      }
    }
  }
}

GET synonym_test/_analyze
{
  "text": "begin working",
  "analyzer": "by_smart"
}
Run Code Online (Sandbox Code Playgroud)

输出:

{
  "tokens": [
    {
      "token": "be",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "beg",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "begi",
      "start_offset": 0,
      "end_offset": 5,
      "type": "word",
      "position": 0
    },
    {
      "token": "start",
      "start_offset": 0,
      "end_offset": 5,
      "type": "SYNONYM",
      "position": 1
    },
    {
      "token": "wo",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "wor",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "work",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "worki",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "workin",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    },
    {
      "token": "working",
      "start_offset": 6,
      "end_offset": 13,
      "type": "word",
      "position": 2
    }
  ]
}
Run Code Online (Sandbox Code Playgroud)

请注意,在Case1中,标记beginstart替换时具有相同的位置,并且没有位置增量。然而,在情况 2中,当begintoken 被替换start为后续 token 流增量的位置时。


现在我的问题是:

  1. 为什么情况 1中没有发生而只发生在情况 2中?
  2. 这导致的主要问题是当输入查询begi wormatch_phrase查询(默认slop0)时它不匹配begin work。这是从那时起发生的begiwor距离 2 个位置。关于如何在不影响我的用例的情况下实现此行为有什么建议吗?

我正在使用5.6.8具有 lucene 版本的ElasticSearch 版本6.6.1

我已经阅读了几个文档链接和文章,但我找不到任何合适的文章来解释为什么会发生这种情况,以及是否有一些设置可以实现我想要的行为。