为什么弹性搜索返回错误的相关性分数?

Ans*_*yay 1 elasticsearch elastic-stack

我正在学习弹性搜索,我在类型为employeemegacorp索引中插入了以下数据:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6931472,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6931472,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

然后我运行了以下请求:

GET /megacorp/employee/_search
{
    "query" : {
        "match" : {
            "about" : "rock climbing"
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

然而我得到的结果如下:

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6682933,
    "hits" : [
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6682933,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ]
        }
      },
      {
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

我怀疑以下记录的相关性得分:

{
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ]
        }
      }
Run Code Online (Sandbox Code Playgroud)

比前一个小。我运行查询

解释:真的

并得到以下结果:

        {
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.6682933,
    "hits" : [
      {
        "_shard" : "[megacorp][2]",
        "_node" : "pGtCz_FvSTmteJwQKvn_lg",
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "2",
        "_score" : 0.6682933,
        "_source" : {
          "first_name" : "Jane",
          "last_name" : "Smith",
          "age" : 32,
          "about" : "I like to collect rock albums",
          "interests" : [
            "music"
          ],
          "fielddata" : true
        },
        "_explanation" : {
          "value" : 0.6682933,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.6682933,
              "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.6682933,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.6931472,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 2.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 0.96414346,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 5.5,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[megacorp][3]",
        "_node" : "pGtCz_FvSTmteJwQKvn_lg",
        "_index" : "megacorp",
        "_type" : "employee",
        "_id" : "1",
        "_score" : 0.5753642,
        "_source" : {
          "first_name" : "John",
          "last_name" : "Smith",
          "age" : 25,
          "about" : "I love to go rock climbing",
          "interests" : [
            "sports",
            "music"
          ],
          "fielddata" : true
        },
        "_explanation" : {
          "value" : 0.5753642,
          "description" : "sum of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "weight(about:rock in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value" : 0.2876821,
              "description" : "weight(about:climbing in 0) [PerFieldSimilarity], result of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
                  "details" : [
                    {
                      "value" : 0.2876821,
                      "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "docFreq",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.0,
                          "description" : "docCount",
                          "details" : [ ]
                        }
                      ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                      "details" : [
                        {
                          "value" : 1.0,
                          "description" : "termFreq=1.0",
                          "details" : [ ]
                        },
                        {
                          "value" : 1.2,
                          "description" : "parameter k1",
                          "details" : [ ]
                        },
                        {
                          "value" : 0.75,
                          "description" : "parameter b",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "avgFieldLength",
                          "details" : [ ]
                        },
                        {
                          "value" : 6.0,
                          "description" : "fieldLength",
                          "details" : [ ]
                        }
                      ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

你能告诉我这背后的原因是什么吗?

Pio*_*ski 6

简短回答:Elasticsearch 中的相关性不是一个简单的话题 :) 详细信息如下。

我试图重现你的情况......

首先我放了两个文件:

POST /megacorp/employee/1
{
  "first_name": "John",
  "last_name": "Smith",
  "age": 25,
  "about": "I love to go rock climbing",
  "interests": [
    "sports",
    "music"
  ]
}

POST /megacorp/employee/2
{
  "first_name": "Jane",
  "last_name": "Smith",
  "age": 32,
  "about": "I like to collect rock albums",
  "interests": [
    "music"
  ]
}
Run Code Online (Sandbox Code Playgroud)

后来我使用了您的查询:

GET /megacorp/employee/_search
{
  "query": {
    "match": {
      "about": "rock climbing"
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

我的结果完全不同:

{
  "took": 89,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "first_name": "John",
          "last_name": "Smith",
          "age": 25,
          "about": "I love to go rock climbing",
          "interests": [
            "sports",
            "music"
          ]
        }
      },
      {
        "_index": "megacorp",
        "_type": "employee",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "first_name": "Jane",
          "last_name": "Smith",
          "age": 32,
          "about": "I like to collect rock albums",
          "interests": [
            "music"
          ]
        }
      }
    ]
  }
}
Run Code Online (Sandbox Code Playgroud)

如您所见,结果按“预期”顺序排列。请注意,这些_score值与您完全不同。

问题是:为什么?发生了什么?

实用 BM25 - 第 1 部分:分片如何影响 Elasticsearch 中的相关性评分文章中描述了这种情况的详细答案。

很快:您可能会注意到 Elasticsearch 将文档存储在分片中。为了更快,默认情况下它使用query_then_fetch策略。这意味着 Elasticsearch 首先询问每个分片上的结果,然后获取结果并将它们呈现给用户。当然,分数计算也是如此。

如您所见,在我们的结果中查询了 5 个分片。如果在创建索引时未指定(可以使用number_of_shardsparam指定),则 Elasticsearch 默认使用 5 个分片。这就是为什么我们的分数不同的原因。此外,如果您尝试自己再次执行此操作,则很有可能再次获得不同的结果。一切都取决于文档在分片之间的分布方式。如果number_of_shards将此索引设置为 1,则每次都将获得相同的分数。

文章中还提到的另一件事是:

人们开始将一些文档加载到他们的索引中并询问“为什么文档 A 的分数比文档 B 高/低”,有时答案是用户的分片与文档的比例相对较高,因此分数出现偏差跨越不同的碎片。

Elasticsearch 旨在维护大量数据,放入索引的数据越多,获得的结果就越准确。

希望我的回答能解开你的疑惑。