如何修改标准分析器以包含#?

use*_*109 1 analyzer elasticsearch

某些字符被视为像# 这样的分隔符,因此它们在查询中永远不会匹配。最接近标准的自定义分析器配置应该是什么,以允许匹配这些字符?

Chi*_*h25 6

1)最简单的方法是使用带有小写过滤器的空白标记

curl -XGET 'localhost:9200/_analyze?tokenizer=whitespace&filters=lowercase&pretty' -d 'new year #celebration vegas'
Run Code Online (Sandbox Code Playgroud)

这会给你

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "#celebration",
    "start_offset" : 9,
    "end_offset" : 21,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "word",
    "position" : 4
  } ]
}
Run Code Online (Sandbox Code Playgroud)

2)如果您只想保留一些特殊字符,则可以使用char filter映射它们,以便您的文本在tokenization发生之前转换为其他内容。这更接近standard analyzer. 例如,您可以像这样创建索引

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "special_analyzer": {
          "char_filter": [
            "special_mapping"
          ],
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "special_mapping": {
          "type": "mapping",
          "mappings": [
            "#=>hashtag\\u0020"
          ]
        }
      }
    }
  },
  "mappings": {
    "my_type": {
      "properties": {
        "tweet": {
          "type": "string",
          "analyzer": "special_analyzer"
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

现在curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=special_analyzer&pretty' -d 'new year #celebration vegas' 自定义分析器将生成以下令牌

{
  "tokens" : [ {
    "token" : "new",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "year",
    "start_offset" : 4,
    "end_offset" : 8,
    "type" : "<ALPHANUM>",
    "position" : 2
  }, {
    "token" : "hashtag",
    "start_offset" : 9,
    "end_offset" : 9,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "celebration",
    "start_offset" : 10,
    "end_offset" : 21,
    "type" : "<ALPHANUM>",
    "position" : 4
  }, {
    "token" : "vegas",
    "start_offset" : 22,
    "end_offset" : 27,
    "type" : "<ALPHANUM>",
    "position" : 5
  } ]
}
Run Code Online (Sandbox Code Playgroud)

所以你可以这样搜索

GET my_index/_search
{
  "query": {
    "match": {
      "tweet": "#celebration"
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

您还可以只搜索庆祝活动,因为我使用了 unicode 作为空间,\\u0020否则我们将始终需要搜索#

希望这可以帮助!!