为什么弹性搜索返回错误的相关性分数？

Question

为什么弹性搜索返回错误的相关性分数？

我正在学习弹性搜索，我在类型为employee的megacorp索引中插入了以下数据：

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.6931472, "hits" : [ { "_index" : "megacorp", "_type" : "employee", "_id" : "2", "_score" : 0.6931472, "_source" : { "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests" : [ "music" ] } }, { "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_score" : 0.2876821, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests" : [ "sports", "music" ] } } ] } }
Run Code Online (Sandbox Code Playgroud)
然后我运行了以下请求：

GET /megacorp/employee/_search { "query" : { "match" : { "about" : "rock climbing" } } }
Run Code Online (Sandbox Code Playgroud)
然而我得到的结果如下：

{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.6682933, "hits" : [ { "_index" : "megacorp", "_type" : "employee", "_id" : "2", "_score" : 0.6682933, "_source" : { "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests" : [ "music" ] } }, { "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_score" : 0.5753642, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests" : [ "sports", "music" ] } } ] } }
Run Code Online (Sandbox Code Playgroud)
我怀疑以下记录的相关性得分：

{ "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_score" : 0.5753642, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests" : [ "sports", "music" ] } }
Run Code Online (Sandbox Code Playgroud)
比前一个小。我运行查询

解释：真的

并得到以下结果：

{ "took" : 4, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : 2, "max_score" : 0.6682933, "hits" : [ { "_shard" : "[megacorp][2]", "_node" : "pGtCz_FvSTmteJwQKvn_lg", "_index" : "megacorp", "_type" : "employee", "_id" : "2", "_score" : 0.6682933, "_source" : { "first_name" : "Jane", "last_name" : "Smith", "age" : 32, "about" : "I like to collect rock albums", "interests" : [ "music" ], "fielddata" : true }, "_explanation" : { "value" : 0.6682933, "description" : "sum of:", "details" : [ { "value" : 0.6682933, "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.6682933, "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details" : [ { "value" : 0.6931472, "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details" : [ { "value" : 1.0, "description" : "docFreq", "details" : [ ] }, { "value" : 2.0, "description" : "docCount", "details" : [ ] } ] }, { "value" : 0.96414346, "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0", "details" : [ ] }, { "value" : 1.2, "description" : "parameter k1", "details" : [ ] }, { "value" : 0.75, "description" : "parameter b", "details" : [ ] }, { "value" : 5.5, "description" : "avgFieldLength", "details" : [ ] }, { "value" : 6.0, "description" : "fieldLength", "details" : [ ] } ] } ] } ] } ] } }, { "_shard" : "[megacorp][3]", "_node" : "pGtCz_FvSTmteJwQKvn_lg", "_index" : "megacorp", "_type" : "employee", "_id" : "1", "_score" : 0.5753642, "_source" : { "first_name" : "John", "last_name" : "Smith", "age" : 25, "about" : "I love to go rock climbing", "interests" : [ "sports", "music" ], "fielddata" : true }, "_explanation" : { "value" : 0.5753642, "description" : "sum of:", "details" : [ { "value" : 0.2876821, "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.2876821, "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details" : [ { "value" : 0.2876821, "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details" : [ { "value" : 1.0, "description" : "docFreq", "details" : [ ] }, { "value" : 1.0, "description" : "docCount", "details" : [ ] } ] }, { "value" : 1.0, "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0", "details" : [ ] }, { "value" : 1.2, "description" : "parameter k1", "details" : [ ] }, { "value" : 0.75, "description" : "parameter b", "details" : [ ] }, { "value" : 6.0, "description" : "avgFieldLength", "details" : [ ] }, { "value" : 6.0, "description" : "fieldLength", "details" : [ ] } ] } ] } ] }, { "value" : 0.2876821, "description" : "weight(about:climbing in 0) [PerFieldSimilarity], result of:", "details" : [ { "value" : 0.2876821, "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:", "details" : [ { "value" : 0.2876821, "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:", "details" : [ { "value" : 1.0, "description" : "docFreq", "details" : [ ] }, { "value" : 1.0, "description" : "docCount", "details" : [ ] } ] }, { "value" : 1.0, "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:", "details" : [ { "value" : 1.0, "description" : "termFreq=1.0", "details" : [ ] }, { "value" : 1.2, "description" : "parameter k1", "details" : [ ] }, { "value" : 0.75, "description" : "parameter b", "details" : [ ] }, { "value" : 6.0, "description" : "avgFieldLength", "details" : [ ] }, { "value" : 6.0, "description" : "fieldLength", "details" : [ ] } ] } ] } ] } ] } } ] } }
Run Code Online (Sandbox Code Playgroud)
你能告诉我这背后的原因是什么吗？

Answer 1

Pio*_*ski 6

简短回答：Elasticsearch 中的相关性不是一个简单的话题 :) 详细信息如下。

我试图重现你的情况......

首先我放了两个文件：

POST /megacorp/employee/1
{
  "first_name": "John",
  "last_name": "Smith",
  "age": 25,
  "about": "I love to go rock climbing",
  "interests": [
    "sports",
    "music"
  ]
}

POST /megacorp/employee/2
{
  "first_name": "Jane",
  "last_name": "Smith",
  "age": 32,
  "about": "I like to collect rock albums",
  "interests": [
    "music"
  ]
}

Run Code Online (Sandbox Code Playgroud)

后来我使用了您的查询：

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "about": "rock climbing"
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

我的结果完全不同：

{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}

Run Code Online (Sandbox Code Playgroud)

如您所见，结果按“预期”顺序排列。请注意，这些_score值与您完全不同。

问题是：为什么？发生了什么？

实用 BM25 - 第 1 部分：分片如何影响 Elasticsearch 中的相关性评分文章中描述了这种情况的详细答案。

很快：您可能会注意到 Elasticsearch 将文档存储在分片中。为了更快，默认情况下它使用query_then_fetch策略。这意味着 Elasticsearch 首先询问每个分片上的结果，然后获取结果并将它们呈现给用户。当然，分数计算也是如此。

如您所见，在我们的结果中查询了 5 个分片。如果在创建索引时未指定（可以使用number_of_shardsparam指定），则 Elasticsearch 默认使用 5 个分片。这就是为什么我们的分数不同的原因。此外，如果您尝试自己再次执行此操作，则很有可能再次获得不同的结果。一切都取决于文档在分片之间的分布方式。如果number_of_shards将此索引设置为 1，则每次都将获得相同的分数。

文章中还提到的另一件事是：

人们开始将一些文档加载到他们的索引中并询问“为什么文档 A 的分数比文档 B 高/低”，有时答案是用户的分片与文档的比例相对较高，因此分数出现偏差跨越不同的碎片。

Elasticsearch 旨在维护大量数据，放入索引的数据越多，获得的结果就越准确。

希望我的回答能解开你的疑惑。

归档时间：	6 年，10 月前
查看次数：	508 次
最近记录：	6 年，10 月前