使用带有Elingsearch和Lucene 4.4的带状疱疹和停止词

ev0*_*n37 6 lucene stop-words elasticsearch

在我正在构建的索引中,我对运行查询感兴趣,然后(使用facets)返回该查询的带状疱疹.这是我在文本中使用的分析器:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "standard",
          "filter": [
            "standard",
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer"
          ]
        }
      },
      "filter": {
        "custom_stemmer" : {
            "type": "stemmer",
            "name": "english"
        },
        "custom_stop": {
            "type": "stop",
            "stopwords": "_english_"
        },
        "custom_shingle": {
            "type": "shingle",
            "min_shingle_size": "2",
            "max_shingle_size": "3"
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

主要问题是,使用Lucene 4.4,停止过滤器不再支持enable_position_increments参数来消除包含停用词的带状疱疹.相反,我会得到像...的结果

"红色和黄色"

"terms": [
    {
        "term": "red",
        "count": 43
    },
    {
        "term": "red _",
        "count": 43
    },
    {
        "term": "red _ yellow",
        "count": 43
    },
    {
        "term": "_ yellow",
        "count": 42
    },
    {
        "term": "yellow",
        "count": 42
    }
]
Run Code Online (Sandbox Code Playgroud)

当然,这大大扭曲了返回的带状疱疹的数量.后Lucene 4.4有没有办法管理它而不对结果进行后处理?

Cur*_*ous 7

可能不是最理想的解决方案,但最直接的是在分析仪中添加另一个过滤器以杀死"_"填充令牌.在下面的例子中,我称之为"kill_fillers":

   "shingleAnalyzer": {
      "tokenizer": "standard",
      "filter": [
        "standard",
        "lowercase",
        "custom_stop",
        "custom_shingle",
        "custom_stemmer",
        "kill_fillers"
       ],
       ...
Run Code Online (Sandbox Code Playgroud)

将"kill_fillers"过滤器添加到您的过滤器列表中:

"filters":{
...
  "kill_fillers": {
    "type": "pattern_replace",
    "pattern": ".*_.*",
    "replace": "",
  },
...
}
Run Code Online (Sandbox Code Playgroud)