pan*_*kaj 3 tokenize analyzer elasticsearch elasticsearch-6
要求是创建一个自定义分析器,该分析器可以生成两个令牌,如以下方案所示。
例如
Input -> B.tech in
Output Tokens ->
- btechin
- b.tech in
Run Code Online (Sandbox Code Playgroud)
我可以删除非字母数字字符,但如何在输出令牌列表中也保留原始字符。以下是我创建的自定义分析器。
"alphanumericStringAnalyzer": {
"filter": [
"lowercase",
"minLength_filter"],
"char_filter": [
"specialCharactersFilter"
],
"type": "custom",
"tokenizer": "keyword"
}
"char_filter": {
"specialCharactersFilter": {
"pattern": "[^A-Za-z0-9]",
"type": "pattern_replace",
"replacement": ""
}
},
Run Code Online (Sandbox Code Playgroud)
该分析器正在为输入“ B.tech in”生成单个令牌“ btechin”,但我也希望令牌列表“ B.tech in”中也有原始令牌。
谢谢!
您可以按照本文档中的说明使用令牌分隔符一词
这是单词定界符配置的示例:
POST _analyze
{
"text": "B.tech in",
"tokenizer": "keyword",
"filter": [
"lowercase",
{
"type": "word_delimiter",
"catenate_all": true,
"preserve_original": true,
"generate_word_parts": false
}
]
}
Run Code Online (Sandbox Code Playgroud)
结果:
{
"tokens": [
{
"token": "b.tech in",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
},
{
"token": "btechin",
"start_offset": 0,
"end_offset": 9,
"type": "word",
"position": 0
}
]
}
Run Code Online (Sandbox Code Playgroud)
希望它能满足您的要求!
| 归档时间: |
|
| 查看次数: |
33 次 |
| 最近记录: |