ale*_*esc 21 autocomplete duplicates elasticsearch elasticsearch-5
ElasticSearch 5.x对Suggester API(文档)引入了一些(重大)更改.最值得注意的变化如下:
完成建议是面向文档的
建议知道他们所属的文件.现在,关联的文档(
_source)将作为完成建议的一部分返回.
简而言之,所有完成查询都返回所有匹配的文档而不是匹配的单词.这就是问题所在 - 如果自动填充的单词出现在多个文档中,则会重复这些单词.
假设我们有这个简单的映射:
{
"my-index": {
"mappings": {
"users": {
"properties": {
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
},
"suggest": {
"type": "completion",
"analyzer": "simple"
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
有一些测试文件:
{
"_index": "my-index",
"_type": "users",
"_id": "1",
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"_index": "my-index",
"_type": "users",
"_id": "2",
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
Run Code Online (Sandbox Code Playgroud)
一本书的查询:
POST /my-index/_suggest?pretty
{
"my-suggest" : {
"text" : "joh",
"completion" : {
"field" : "suggest"
}
}
}
Run Code Online (Sandbox Code Playgroud)
这产生以下结果:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "1",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Doe",
"suggest": [
{
"input": [
"John",
"Doe"
]
}
]
}
},
{
"text": "John",
"_index": "my-index",
"_type": "users",
"_id": "2",
"_score": 1,
"_source": {
"firstName": "John",
"lastName": "Smith",
"suggest": [
{
"input": [
"John",
"Smith"
]
}
]
}
}
]
}
]
}
Run Code Online (Sandbox Code Playgroud)
简而言之,对于文本"joh"的完成建议,返回了两(2)个文档 - John和两者都具有相同的text属性值.
但是,我想收到一(1)个字.像这样简单:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"my-suggest": [
{
"text": "joh",
"offset": 0,
"length": 3,
"options": [
"John"
]
}
]
}
Run Code Online (Sandbox Code Playgroud)
问题:如何实现基于单词的完成建议器.没有必要返回任何与文档相关的数据,因为此时我不需要它.
"完成建议者"是否适合我的情景?或者我应该使用完全不同的方法?
编辑:正如你们许多人所指出的那样,一个额外的完成指数将是一个可行的解决方案.但是,我可以看到这种方法存在多个问题:
"John", "Doe", "David", "Smith".在查询时"John D",不完整单词的结果应该是,"Doe"而不是"Doe", "David".要克服第二点,仅索引单个单词是不够的,因为您还需要将所有单词映射到文档,以便正确缩小自动完成后续单词.有了这个,你实际上遇到了与查询原始索引相同的问题.因此,附加索引不再有意义.
Val*_*Val 19
正如在评论中暗示的那样,在不获取重复文档的情况下实现此目的的另一种方法是为firstname包含该字段的ngrams的字段创建子字段.首先,您可以像这样定义映射:
PUT my-index
{
"settings": {
"analysis": {
"analyzer": {
"completion_analyzer": {
"type": "custom",
"filter": [
"lowercase",
"completion_filter"
],
"tokenizer": "keyword"
}
},
"filter": {
"completion_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 24
}
}
}
},
"mappings": {
"users": {
"properties": {
"autocomplete": {
"type": "text",
"fields": {
"raw": {
"type": "keyword"
},
"completion": {
"type": "text",
"analyzer": "completion_analyzer",
"search_analyzer": "standard"
}
}
},
"firstName": {
"type": "text"
},
"lastName": {
"type": "text"
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
然后你索引一些文件:
POST my-index/users/_bulk
{"index":{}}
{ "firstName": "John", "lastName": "Doe", "autocomplete": "John Doe"}
{"index":{}}
{ "firstName": "John", "lastName": "Deere", "autocomplete": "John Deere" }
{"index":{}}
{ "firstName": "Johnny", "lastName": "Cash", "autocomplete": "Johnny Cash" }
Run Code Online (Sandbox Code Playgroud)
然后你可以查询joh并获得一个结果John,另一个结果Johnny
{
"size": 0,
"query": {
"term": {
"autocomplete.completion": "john d"
}
},
"aggs": {
"suggestions": {
"terms": {
"field": "autocomplete.raw"
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
结果:
{
"aggregations": {
"suggestions": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "John Doe",
"doc_count": 1
},
{
"key": "John Deere",
"doc_count": 1
}
]
}
}
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6337 次 |
| 最近记录: |