ev0*_*n37 6 lucene stop-words elasticsearch
在我正在构建的索引中,我对运行查询感兴趣,然后(使用facets)返回该查询的带状疱疹.这是我在文本中使用的分析器:
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer"
]
}
},
"filter": {
"custom_stemmer" : {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "3"
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
主要问题是,使用Lucene 4.4,停止过滤器不再支持enable_position_increments
参数来消除包含停用词的带状疱疹.相反,我会得到像...的结果
"红色和黄色"
"terms": [
{
"term": "red",
"count": 43
},
{
"term": "red _",
"count": 43
},
{
"term": "red _ yellow",
"count": 43
},
{
"term": "_ yellow",
"count": 42
},
{
"term": "yellow",
"count": 42
}
]
Run Code Online (Sandbox Code Playgroud)
当然,这大大扭曲了返回的带状疱疹的数量.后Lucene 4.4有没有办法管理它而不对结果进行后处理?
可能不是最理想的解决方案,但最直接的是在分析仪中添加另一个过滤器以杀死"_"填充令牌.在下面的例子中,我称之为"kill_fillers":
"shingleAnalyzer": {
"tokenizer": "standard",
"filter": [
"standard",
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"kill_fillers"
],
...
Run Code Online (Sandbox Code Playgroud)
将"kill_fillers"过滤器添加到您的过滤器列表中:
"filters":{
...
"kill_fillers": {
"type": "pattern_replace",
"pattern": ".*_.*",
"replace": "",
},
...
}
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1343 次 |
最近记录: |