Elasticsearch - Analyzer创建正确的令牌但查询不匹配

use*_*210 6 elasticsearch

我试图让Elasticsearch忽略连字符.我不希望它将连字符的任何一边分成单独的单词.看起来很简单,但我正在敲打墙壁.

我希望字符串"Roland JD-Xi"产生以下术语:[roland jd-xi,roland,jd-xi,jdxi,roland jdxi]

我无法轻易实现这一目标.大多数人只会键入'jdxi',所以我最初的想法就是删除连字符.所以我使用以下定义

  name: {
"type": "string",
"analyzer": "language",
"include_in_all": true,
"boost": 5,
"fields": {
    "my_standard": {
        "type": "string",
        "analyzer": "my_standard"
    },
    "my_prefix": {
        "type": "string",
        "analyzer": "my_text_prefix",
        "search_analyzer": "my_standard"
    },
    "my_suffix": {
        "type": "string",
        "analyzer": "my_text_suffix",
        "search_analyzer": "my_standard"
    }
}
Run Code Online (Sandbox Code Playgroud)

}

相关的分析仪和过滤器定义为

{
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
    "analyzer": {
        "std": {
            "tokenizer": "standard",
            "char_filter": "html_strip",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "length", "strip_hyphens"]
        ...
        "my_text_prefix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_front"]
        },
        "my_text_suffix": {
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase", "stop", "edge_ngram_back"]
        },
        "my_standard": {
            "type": "custom",
            "tokenizer": "whitespace",
            "char_filter": "my_filter",
            "filter": ["standard", "elision", "asciifolding", "lowercase"]
        }
    },
    "char_filter": {
        "my_filter": {
            "type": "mapping",
            "mappings": ["- => ", ". => "]
        }
    },
    "filter": {
        "edge_ngram_front": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "front"
        },
        "edge_ngram_back": {
            "type": "edgeNGram",
            "min_gram": 1,
            "max_gram": 20,
            "side": "back"
        },
        "strip_spaces": {
            "type": "pattern_replace",
            "pattern": "\\s",
            "replacement": ""
        },
        "strip_dots": {
            "type": "pattern_replace",
            "pattern": "\\.",
            "replacement": ""
        },
        "strip_hyphens": {
            "type": "pattern_replace",
            "pattern": "-",
            "replacement": ""
        },
        "stop": {
            "type": "stop",
            "stopwords": "_none_"
        },
        "length": {
            "type": "length",
            "min": 1
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

我已经能够测试(即_analyze)这个并且字符串"Roland JD-Xi"被标记为[roland,jdxi]

它不完全是我想要的但足够接近它应该匹配'jdxi'.

但这就是我的问题.如果我做一个简单的"index/_search?q = jdxi",它就不会带回文件.但是,如果我执行"index/_search?q = roland + jdxi",它会带回文档.

所以至少我知道连字符被删除但是如果正在创建令牌"roland"和"jdxi",那么"index/_search?q = jdxi"与文档不匹配?

  1. 我的索引流程或查询流程有问题吗?
  2. 我如何解决它?
  3. 任何人都可以解释如何实现所需的令牌[roland jd-xi,roland,jd-xi,jdxi,roland jdxi]

Val*_*Val 3

我在 ES 6 上重现了您的案例,并搜索index/_search?q=jdxi返回了文档。

问题可能是,在index/_search?q=jdxi不指定字段的情况下进行搜索时,它基本上会搜索其中_all包含该字段中的任何内容name(基本上与 相同index/_search?q=name:jdxi)。由于未使用您的my_standard分析器分析该字段,因此您不会得到任何结果。

您应该做的是使用my_standard子字段进行搜索,即index/_search?q=name.my_standard:jdxi并且非常确定您会得到文档。