我有一组通过NLP算法从文本中提取的单词,每个文档中每个单词的相关分数.
例如 :
document 1: { "vocab": [ {"wtag":"James Bond", "rscore": 2.14 },
{"wtag":"world", "rscore": 0.86 },
....,
{"wtag":"somemore", "rscore": 3.15 }
]
}
document 2: { "vocab": [ {"wtag":"hiii", "rscore": 1.34 },
{"wtag":"world", "rscore": 0.94 },
....,
{"wtag":"somemore", "rscore": 3.23 }
]
}
Run Code Online (Sandbox Code Playgroud)
我想在每个文档rscore中匹配wtag,以影响_scoreES 的给定,可能乘以或添加到_score,以影响_score结果文档的最终(依次,顺序).有没有办法实现这个目标?
DrT*_*ech 17
接近这个的另一种方法是使用嵌套文档:
首先设置映射以vocab生成嵌套文档,这意味着每个wtag/ rscore文档将在内部编入索引作为单独的文档:
curl -XPUT "http://localhost:9200/myindex/" -d'
{
"settings": {"number_of_shards": 1},
"mappings": {
"mytype": {
"properties": {
"vocab": {
"type": "nested",
"fields": {
"wtag": {
"type": "string"
},
"rscore": {
"type": "float"
}
}
}
}
}
}
}'
Run Code Online (Sandbox Code Playgroud)
然后索引你的文档:
curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"vocab": [
{
"wtag": "James Bond",
"rscore": 2.14
},
{
"wtag": "world",
"rscore": 0.86
},
{
"wtag": "somemore",
"rscore": 3.15
}
]
}'
curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
"vocab": [
{
"wtag": "hiii",
"rscore": 1.34
},
{
"wtag": "world",
"rscore": 0.94
},
{
"wtag": "somemore",
"rscore": 3.23
}
]
}'
Run Code Online (Sandbox Code Playgroud)
并运行nested查询以匹配所有嵌套文档,并rscore为每个匹配的嵌套文档添加值:
curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"nested": {
"path": "vocab",
"score_mode": "sum",
"query": {
"function_score": {
"query": {
"match": {
"vocab.wtag": "james bond world"
}
},
"script_score": {
"script": "doc[\"rscore\"].value"
}
}
}
}
}
}'
Run Code Online (Sandbox Code Playgroud)
查看分隔的有效负载令牌过滤器,您可以使用该过滤器将分数存储为有效负载,并在脚本中进行文本评分,以便您访问有效负载.
更新包含示例
首先,您需要设置一个分析器,该分析器将获取后面的数字|并将该值存储为每个令牌的有效负载:
curl -XPUT "http://localhost:9200/myindex/" -d'
{
"settings": {
"analysis": {
"analyzer": {
"payloads": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
" delimited_payload_filter"
]
}
}
}
},
"mappings": {
"mytype": {
"properties": {
"text": {
"type": "string",
"analyzer": "payloads",
"term_vector": "with_positions_offsets_payloads"
}
}
}
}
}'
Run Code Online (Sandbox Code Playgroud)
然后索引您的文档:
curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"text": "James|2.14 Bond|2.14 world|0.86 somemore|3.15"
}'
Run Code Online (Sandbox Code Playgroud)
最后,使用function_score遍历每个术语的查询进行搜索,检索有效负载并将其与以下内容合并_score:
curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"function_score": {
"query": {
"match": {
"text": "james bond"
}
},
"script_score": {
"script": "score=0; for (term: my_terms) { termInfo = _index[\"text\"].get(term,_PAYLOADS ); for (pos : termInfo) { score = score + pos.payloadAsFloat(0);} } return score;",
"params": {
"my_terms": [
"james",
"bond"
]
}
}
}
}
}'
Run Code Online (Sandbox Code Playgroud)
当脚本本身没有压缩成一行时,它看起来像这样:
score=0;
for (term: my_terms) {
termInfo = _index['text'].get(term,_PAYLOADS );
for (pos : termInfo) {
score = score + pos.payloadAsFloat(0);
}
}
return score;
Run Code Online (Sandbox Code Playgroud)
警告:访问有效负载具有显着的性能成本,并且运行脚本也具有性能成本.您可能希望使用上面的动态脚本对其进行试验,然后在对结果满意时将脚本重写为本机Java脚本.
| 归档时间: |
|
| 查看次数: |
9980 次 |
| 最近记录: |