字段中的点不用于分解分析器的单词

Question

字段中的点不用于分解分析器的单词

我有以下索引文档映射（简化）

{
    "documents": {
        "mappings": {
            "document": {
                "properties": {
                    "filename": {
                        "type": "string",
                        "fields": {
                            "lower_case_sort": {
                                "type": "string",
                                "analyzer": "case_insensitive_sort"
                            },
                            "raw": {
                                "type": "string",
                                "index": "not_analyzed"
                            }
                        }
                    }
                }
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

我将两个文档放入该索引

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "_source": {
        "filename": "text.txt",
    }
}

Run Code Online (Sandbox Code Playgroud)

...

{
    "_index": "documents",
    "_type": "document",
    "_id": "888",
    "_source": {
        "filename": "text 123.txt",
    }
}

Run Code Online (Sandbox Code Playgroud)

对“文本”进行 query_string 或 simple_query_string 查询我希望能返回两个文档。它们应该匹配，因为文件名是“text.txt”和“text 123.txt”。

http://localhost:9200/defiant/_search?q=text

Run Code Online (Sandbox Code Playgroud)

但是，我只找到名称为“test 123.txt”的文档 - 仅当我搜索“test.*”或“test.txt”或“test.???”时才能找到“test.txt” - 我必须在文件名中添加点。

这是我针对文档 id 777 (text.txt) 的解释结果

curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'

Run Code Online (Sandbox Code Playgroud)

-->

{
    "_index": "documents",
    "_type": "document",
    "_id": "777",
    "matched": false,
    "explanation": {
        "value": 0.0,
        "description": "Failure to meet condition(s) of required/prohibited clause(s)",
        "details": [{
            "value": 0.0,
            "description": "no match on required clause (_all:text)",
            "details": [{
                "value": 0.0,
                "description": "no matching term",
                "details": []
            }]
        }, {
            "value": 0.0,
            "description": "match on required clause, product of:",
            "details": [{
                "value": 0.0,
                "description": "# clause",
                "details": []
            }, {
                "value": 0.47650534,
                "description": "_type:document, product of:",
                "details": [{
                    "value": 1.0,
                    "description": "boost",
                    "details": []
                }, {
                    "value": 0.47650534,
                    "description": "queryNorm",
                    "details": []
                }]
            }]
        }]
    }
}

Run Code Online (Sandbox Code Playgroud)

我搞砸了映射吗？我本以为“.” 当文档被索引时被分析为术语分隔符...

编辑：case_insensitive_sort的设置

{
    "documents": {
        "settings": {
            "index": {
                "creation_date": "1473169458336",
                "analysis": {
                    "analyzer": {
                        "case_insensitive_sort": {
                            "filter": [
                                "lowercase"
                            ],
                            "tokenizer": "keyword"
                        }
                    }
                }
            }
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

Chi*_*h25 6

这将是standard analyzer（默认分析器）的预期行为，因为它使用标准分词器，并且根据它使用的算法，点不被视为分隔字符。

您可以借助analyze api来验证这一点

curl -XGET 'localhost:9200/_analyze' -d '
{
  "analyzer" : "standard",
  "text" : "test.txt"
}'

Run Code Online (Sandbox Code Playgroud)

仅生成单个令牌

{
  "tokens": [
    {
      "token": "test.txt",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}

Run Code Online (Sandbox Code Playgroud)

您可以使用模式替换字符过滤器将点替换为空白。

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "replace_dot"
          ]
        }
      },
      "char_filter": {
        "replace_dot": {
          "type": "pattern_replace",
          "pattern": "\\.",
          "replacement": " "
        }
      }
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

您必须重新索引您的文档，然后您将获得所需的结果。分析 api可以非常方便地检查文档如何存储在倒排索引中。

更新

您必须指定要搜索的字段的名称。以下请求在_all 字段中查找文本，默认情况下使用标准分析器。

http://localhost:9200/defiant/_search?q=text

Run Code Online (Sandbox Code Playgroud)

我认为下面的查询应该会给你想要的结果。

curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，2 月前
查看次数：	1730 次
最近记录：	9 年，2 月前