raf*_*ron 5 lucene search elasticsearch
edge_ngrams我的用例是在支持下搜索synonym要匹配的令牌应按顺序排列的位置。
在尝试分析时,我观察到过滤器链在位置增量方面有两种不同的行为。
lowercase, synonym没有位置增量SynonymFilterlowercase, edge_ngram, synonym存在位置增量SynonymFilter以下是我针对每种情况运行的查询:
情况1.无位置增量
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Run Code Online (Sandbox Code Playgroud)
输出:
{
"tokens": [
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 0
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 1
}
]
}
Run Code Online (Sandbox Code Playgroud)
情况2.仓位增量
PUT synonym_test
{
"index": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"custom_edge_ngram",
"custom_synonym"
]
}
},
"filter": {
"custom_synonym": {
"type": "synonym",
"synonyms": [
"begin => start"
]
},
"custom_edge_ngram": {
"type": "edge_ngram",
"min_gram": "2",
"max_gram": "60"
}
}
}
}
}
GET synonym_test/_analyze
{
"text": "begin working",
"analyzer": "by_smart"
}
Run Code Online (Sandbox Code Playgroud)
输出:
{
"tokens": [
{
"token": "be",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "beg",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "begi",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "start",
"start_offset": 0,
"end_offset": 5,
"type": "SYNONYM",
"position": 1
},
{
"token": "wo",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "wor",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "work",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "worki",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "workin",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
},
{
"token": "working",
"start_offset": 6,
"end_offset": 13,
"type": "word",
"position": 2
}
]
}
Run Code Online (Sandbox Code Playgroud)
请注意,在Case1中,标记begin和start替换时具有相同的位置,并且没有位置增量。然而,在情况 2中,当begintoken 被替换start为后续 token 流增量的位置时。
现在我的问题是:
begi wor与match_phrase查询(默认slop为0)时它不匹配begin work。这是从那时起发生的begi,wor距离 2 个位置。关于如何在不影响我的用例的情况下实现此行为有什么建议吗?我正在使用5.6.8具有 lucene 版本的ElasticSearch 版本6.6.1。
我已经阅读了几个文档链接和文章,但我找不到任何合适的文章来解释为什么会发生这种情况,以及是否有一些设置可以实现我想要的行为。
| 归档时间: |
|
| 查看次数: |
463 次 |
| 最近记录: |