Elasticsearch:在文档中使用自定义分数字段影响评分

Hay*_*ire 11 elasticsearch

我有一组通过NLP算法从文本中提取的单词,每个文档中每个单词的相关分数.

例如 :

document 1: {  "vocab": [ {"wtag":"James Bond", "rscore": 2.14 }, 
                          {"wtag":"world", "rscore": 0.86 }, 
                          ...., 
                          {"wtag":"somemore", "rscore": 3.15 }
                        ] 
            }

document 2: {  "vocab": [ {"wtag":"hiii", "rscore": 1.34 }, 
                          {"wtag":"world", "rscore": 0.94 },
                          ...., 
                          {"wtag":"somemore", "rscore": 3.23 } 
                        ] 
            }
Run Code Online (Sandbox Code Playgroud)

我想在每个文档rscore中匹配wtag,以影响_scoreES 的给定,可能乘以或添加到_score,以影响_score结果文档的最终(依次,顺序).有没有办法实现这个目标?

DrT*_*ech 17

接近这个的另一种方法是使用嵌套文档:

首先设置映射以vocab生成嵌套文档,这意味着每个wtag/ rscore文档将在内部编入索引作为单独的文档:

curl -XPUT "http://localhost:9200/myindex/" -d'
{
  "settings": {"number_of_shards": 1}, 
  "mappings": {
    "mytype": {
      "properties": {
        "vocab": {
          "type": "nested",
          "fields": {
            "wtag": {
              "type": "string"
            },
            "rscore": {
              "type": "float"
            }
          }
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)

然后索引你的文档:

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
  "vocab": [
    {
      "wtag": "James Bond",
      "rscore": 2.14
    },
    {
      "wtag": "world",
      "rscore": 0.86
    },
    {
      "wtag": "somemore",
      "rscore": 3.15
    }
  ]
}'

curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
  "vocab": [
    {
      "wtag": "hiii",
      "rscore": 1.34
    },
    {
      "wtag": "world",
      "rscore": 0.94
    },
    {
      "wtag": "somemore",
      "rscore": 3.23
    }
  ]
}'
Run Code Online (Sandbox Code Playgroud)

并运行nested查询以匹配所有嵌套文档,并rscore为每个匹配的嵌套文档添加值:

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
  "query": {
    "nested": {
      "path": "vocab",
      "score_mode": "sum",
      "query": {
        "function_score": {
          "query": {
            "match": {
              "vocab.wtag": "james bond world"
            }
          },
          "script_score": {
            "script": "doc[\"rscore\"].value"
          }
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)


DrT*_*ech 8

查看分隔的有效负载令牌过滤器,您可以使用该过滤器将分数存储为有效负载,并在脚本中进行文本评分,以便您访问有效负载.

更新包含示例

首先,您需要设置一个分析器,该分析器将获取后面的数字|并将该值存储为每个令牌的有效负载:

curl -XPUT "http://localhost:9200/myindex/" -d'
{
  "settings": {
    "analysis": {
      "analyzer": {
        "payloads": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            " delimited_payload_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "mytype": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "payloads",
          "term_vector": "with_positions_offsets_payloads"
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)

然后索引您的文档:

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
  "text": "James|2.14 Bond|2.14 world|0.86 somemore|3.15"
}'
Run Code Online (Sandbox Code Playgroud)

最后,使用function_score遍历每个术语的查询进行搜索,检索有效负载并将其与以下内容合并_score:

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
  "query": {
    "function_score": {
      "query": {
        "match": {
          "text": "james bond"
        }
      },
      "script_score": {
        "script": "score=0; for (term: my_terms) { termInfo = _index[\"text\"].get(term,_PAYLOADS ); for (pos : termInfo) { score = score +  pos.payloadAsFloat(0);} } return score;",
        "params": {
          "my_terms": [
            "james",
            "bond"
          ]
        }
      }
    }
  }
}'
Run Code Online (Sandbox Code Playgroud)

当脚本本身没有压缩成一行时,它看起来像这样:

score=0; 
for (term: my_terms) { 
    termInfo = _index['text'].get(term,_PAYLOADS ); 
    for (pos : termInfo) { 
        score = score +  pos.payloadAsFloat(0);
    } 
} 
return score;
Run Code Online (Sandbox Code Playgroud)

警告:访问有效负载具有显着的性能成本,并且运行脚本也具有性能成本.您可能希望使用上面的动态脚本对其进行试验,然后在对结果满意时将脚本重写为本机Java脚本.