Elasticsearch“query_string 中的 AND”与“default_operator AND”

use*_*357 4 elasticsearch

弹性搜索 v7.1.1

\n\n

我不明白包含“AND”的 query_string 与包含“AND”的 query_string 之间的区别。“默认运算符 AND”

\n\n

我认为它应该产生相同的结果,但事实并非如此:

\n\n
HTTP POST http://localhost:9200/umlautsuche\n\n{\n  "settings": {\n    "analysis": {\n      "char_filter": {\n        "my_char_filter": {\n          "type": "mapping",\n          "mappings": ["ph => f"]\n        }\n      },\n      "filter": {\n        "my_ngram": {\n            "type": "edge_ngram",\n            "min_gram": 3,\n            "max_gram": 10\n        }\n      },\n      "analyzer": {\n        "my_name_analyzer": {\n          "tokenizer":  "standard",\n          "char_filter": [\n            "my_char_filter"\n          ],\n          "filter": [\n            "lowercase",\n            "german_normalization"\n          ]\n        }\n      }\n    }\n  },\n  "mappings": {\n    "date_detection": false,\n    "dynamic_templates": [\n      {\n        "string_fields_german": {\n          "match_mapping_type": "string",\n          "match": "*",\n          "mapping": {\n            "type": "text",\n            "analyzer": "my_name_analyzer"\n          }\n        }\n      },\n      {\n        "dates": {\n          "match": "lastModified",\n          "match_pattern": "regex",\n          "mapping": {\n            "type": "date",\n            "ignore_malformed": true\n          }\n        }\n      }\n    ]\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n
HTTP POST http://localhost:9200/_bulk\n\n{ "index" : { "_index" : "umlautsuche", "_id" : "1" } }\n{"vorname": "Stephan-J\xc3\xb6rg", "nachname": "M\xc3\xbcller", "ort": "Hollabrunn"}\n\n{ "index" : { "_index" : "umlautsuche", "_id" : "2" } }\n{"vorname": "Stephan-Joerg", "nachname": "Mueller", "ort": "Hollabrunn"}\n\n{ "index" : { "_index" : "umlautsuche", "_id" : "3" } }\n{"vorname": "Stephan-J\xc3\xb6rg", "nachname": "M\xc3\xbcll", "ort": "Hollabrunn"}\n
Run Code Online (Sandbox Code Playgroud)\n\n

这里没有结果 - 出乎我意料:

\n\n
HTTP POST http://localhost:9200/umlautsuche/_search\n\n{\n  "query": {\n        "query_string": {\n            "query": "Stefan M\xc3\xbcller J\xc3\xb6r*",\n            "analyze_wildcard": true,\n            "default_operator": "AND",\n            "fields": ["vorname", "nachname"]\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

此查询给出了我预期的结果:

\n\n
HTTP POST http://localhost:9200/umlautsuche/_search\n\n{\n  "query": {\n        "query_string": {\n            "query": "Stefan AND M\xc3\xbcller AND J\xc3\xb6r*",\n            "analyze_wildcard": true,\n            "default_operator": "AND",\n            "fields": ["vorname", "nachname"]\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

如何配置查询/分析器,以便我的搜索词之间不需要这些“AND”?

\n

Nik*_*iev 8

您面临的是query_string布尔运算符的布尔逻辑的模糊性,并且可能是未记录的行为。由于这种模糊性,我认为最好使用bool具有显式逻辑的查询,或者使用copy_to.

\n\n

让我更详细地解释一下发生了什么以及如何解决它。

\n\n

为什么第一个查询不匹配?

\n\n

为了查看查询如何执行,让我们设置profile: true

\n\n
POST /umlautsuche/_search\n{\n    "query": {\n        "query_string": {\n            "query": "Stefan M\xc3\xbcller J\xc3\xb6r*",\n            "analyze_wildcard": true,\n            "default_operator": "AND",\n            "fields": [\n                "vorname",\n                "nachname"\n            ]\n        }\n    },\n    "profile": true\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

在 ES 响应中我们将看到:

\n\n
  "profile": {\n    "shards": [\n      {\n        "id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",\n        "searches": [\n          {\n            "query": [\n              {\n                "type": "BooleanQuery",\n                "description": "+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)",\n                "time_in_nanos": 17787641,\n                "breakdown": {\n                  "set_min_competitive_score_count": 0,\n
Run Code Online (Sandbox Code Playgroud)\n\n

我们对这部分感兴趣:

\n\n
"+((+nachname:stefan +nachname:muller) | (+vorname:stefan +vorname:muller)) +(nachname:jor* | vorname:jor*)"\n
Run Code Online (Sandbox Code Playgroud)\n\n

不进行深入分析,我们可以看出这个查询想要查找带有 surnamestefan和 surname 的文档muller,这是不可能的(因为stefan文档中从来没有 surname )。

\n\n

我认为,我们真正想做的是“找到全名是Stefan M\xc3\xbcller J\xc3\xb6r*”的人。这不是 Elasticsearch 生成的查询所做的事情。

\n\n

为什么第二个查询匹配?

\n\n

让我们对 执行同样的操作explain: true。响应将包含以下内容:

\n\n
  "profile": {\n    "shards": [\n      {\n        "id": "[QCANVs5gR0GOiiGCmEwj7w][umlautsuche][0]",\n        "searches": [\n          {\n            "query": [\n              {\n                "type": "BooleanQuery",\n                "description": "+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)",\n                "time_in_nanos": 17970342,\n                "breakdown": {\n
Run Code Online (Sandbox Code Playgroud)\n\n

我们可以看到查询被解释如下:

\n\n
"+(nachname:stefan | vorname:stefan) +(nachname:muller | vorname:muller) +(nachname:jor* | vorname:jor*)"\n
Run Code Online (Sandbox Code Playgroud)\n\n

我们可以粗略地解释为“找到名字或姓氏是这三个名字之一的人”,这就是我们期望它做的事情。

\n\n

query_string在query的文档中,它说它default_operator: AND应该将空格解释为ANDs:

\n\n
\n

如果未指定显式运算符,则使用默认运算符。例如,使用默认运算符 时OR,查询capital of Hungary\n 将转换为capital OR of OR Hungary,而使用默认运算符\n 时AND,同一查询将转换为capital AND of AND Hungary。\n 默认值为OR

\n
\n\n

尽管从我们刚刚看到的情况来看,这似乎并不正确 - 至少在查询多个字段的情况下。

\n\n

那么我们能做些什么呢?

\n\n

bool与显式逻辑一起使用

\n\n

这个查询似乎有效:

\n\n
POST /umlautsuche/_search\n{\n    "query": {\n        "bool": {\n            "must": [\n                {\n                    "query_string": {\n                        "query": "Stefan M\xc3\xbcller J\xc3\xb6r*",\n                        "analyze_wildcard": true,\n                        "fields": [\n                            "vorname"\n                        ]\n                    }\n                },\n                {\n                    "query_string": {\n                        "query": "Stefan M\xc3\xbcller J\xc3\xb6r*",\n                        "analyze_wildcard": true,\n                        "fields": [\n                            "nachname"\n                        ]\n                    }\n                }\n            ]\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

此查询并不完全等效,请将其视为示例。例如,如果我们有另一个这样的记录,没有“J\xc3\xb6rg”:

\n\n
{"vorname": "Stephan", "nachname": "M\xc3\xbcll", "ort": "Hollabrunn"}\n
Run Code Online (Sandbox Code Playgroud)\n\n

bool尽管缺少“J\xc3\xb6rg”,上面的查询仍会匹配它。为了克服这个问题,您可以编写一个更复杂的bool查询,但如果您想避免解析用户输入,则这将不起作用。

\n\n

我们如何仍然使用普通的、未解析的查询字符串?

\n\n

引入一个copy_to字段

\n\n

我们可以尝试使用copy_to能力。它将多个字段的内容复制到另一个字段中,并一起分析这些字段。

\n\n

我们必须修改映射配置(不幸的是,必须重新创建现有索引):

\n\n
  "mappings": {\n    "date_detection": false,\n    "dynamic_templates": [\n            {\n        "name_fields_german": {\n          "match_mapping_type": "string",\n          "match": "*name",\n          "mapping": {\n            "type": "text",\n            "analyzer": "my_name_analyzer",\n            "copy_to": "full_name"\n          }\n        }\n      },\n      {\n        "string_fields_german": {\n          "match_mapping_type": "string",\n          "match": "*",\n          "mapping": {\n            "type": "text",\n            "analyzer": "my_name_analyzer"\n          }\n        }\n      },\n      {\n        "dates": {\n          "match": "lastModified",\n          "match_pattern": "regex",\n          "mapping": {\n            "type": "date",\n            "ignore_malformed": true\n          }\n        }\n      }\n    ]\n  }\n
Run Code Online (Sandbox Code Playgroud)\n\n

然后我们可以按照与之前完全相同的方式填充索引。

\n\n

full_name现在我们可以使用以下查询来查询新字段:

\n\n
POST /umlautsuche/_search\n{\n    "query": {\n        "bool": {\n            "must": [\n                {\n                    "query_string": {\n                        "query": "Stefan M\xc3\xbcller J\xc3\xb6r*",\n                        "analyze_wildcard": true,\n                        "default_operator": "AND",\n                        "fields": [\n                            "full_name"\n                        ]\n                    }\n                }\n            ]\n        }\n    }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

此查询将返回与第二个查询相同的 2 个文档。因此,在这种情况下,default_operator: AND其行为正如我们所期望的那样,要求匹配查询中的所有标记。

\n\n
\n\n

希望有帮助!

\n