tis*_*hma 6 camelcasing elasticsearch
在Elasticsearch中搜索iphone时,努力使iPhone匹配.
由于我有一些利害攸关的源代码,我当然需要CamelCase tokenizer,但它似乎将iPhone分成两个术语,所以无法找到iphone.
任何人都知道一种方法来添加异常以将camelCase单词分解为标记(camel + case)?
更新:为了说清楚,我希望将NullPointerException标记为[null,pointer,exception],但我不希望iPhone成为[i,phone].
还有其他方法吗?
更新2:@ ChintanShah的回答表明了一种不同的方法,它给了我们更多的东西 - NullPointerException将被标记为[null,pointer,exception,nullpointer,pointerexception,nullpointerexception],从这个方面来看,这肯定会更有用.搜索.索引也更快!支付价格是指数大小,但它是一个优秀的解决方案.
您可以使用word_delimiter令牌过滤器来满足您的要求.这是我的设置
{
"settings": {
"analysis": {
"analyzer": {
"camel_analyzer": {
"tokenizer": "whitespace",
"filter": [
"camel_filter",
"lowercase",
"asciifolding"
]
}
},
"filter": {
"camel_filter": {
"type": "word_delimiter",
"generate_number_parts": false,
"stem_english_possessive": false,
"split_on_numerics": false,
"protected_words": [
"iPhone",
"WiFi"
]
}
}
}
},
"mappings": {
}
}
Run Code Online (Sandbox Code Playgroud)
这将在案例更改时拆分单词,因此NullPointerException将被标记为null,指针和异常,但iPhone和WiFi将保持原样,因为它们受到保护.word_delimiter有很多选择灵活性.您还可以使用preserve_original来帮助您.
GET logs_index/_analyze?text=iPhone&analyzer=camel_analyzer
Run Code Online (Sandbox Code Playgroud)
结果
{
"tokens": [
{
"token": "iphone",
"start_offset": 0,
"end_offset": 6,
"type": "word",
"position": 1
}
]
}
Run Code Online (Sandbox Code Playgroud)
现在用
GET logs_index/_analyze?text=NullPointerException&analyzer=camel_analyzer
Run Code Online (Sandbox Code Playgroud)
结果
{
"tokens": [
{
"token": "null",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "pointer",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 2
},
{
"token": "exception",
"start_offset": 11,
"end_offset": 20,
"type": "word",
"position": 3
}
]
}
Run Code Online (Sandbox Code Playgroud)
另一种方法是用不同的分析仪分析你的场两次,但我觉得word_delimiter会做的.
这有帮助吗?
| 归档时间: |
|
| 查看次数: |
921 次 |
| 最近记录: |