如何过滤simple_query_string / query_string查询

Rrr*_*Rrr 6 elasticsearch elasticsearch-percolate

指数:

{
    "settings": {
        "index.percolator.map_unmapped_fields_as_text": true,
    },
    "mappings": {
        "properties": {
            "query": {
                "type": "percolator"
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

此测试过滤器查询有效

{
    "query": {
        "match": {
            "message": "blah"
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

此查询不起作用

{
    "query": {
        "simple_query_string": {
            "query": "bl*"
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

结果:

{"took":15,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.13076457,"hits":[{"_index":"my-index","_type":"_doc","_id":"1","_score":0.13076457,"_source":{"query":{"match":{"message":"blah"}}},"fields":{"_percolator_document_slot":[0]}}]}}
Run Code Online (Sandbox Code Playgroud)

为什么此simple_query_string查询与文档不匹配?

And*_*fan 3

我也不明白你在问什么。可能你不太了解percolator?这是我现在刚刚尝试的一个例子。

假设您有一个索引(我们称之为索引test),您想要在其中索引一些文档。该索引具有以下映射(只是我的测试设置中的随机测试索引):

{  
    "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
    "mappings": {
        "properties": {
            "code": {
                "type": "long"
            },
            "date": {
                "type": "date"
            },
            "part": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "val": {
                "type": "long"
            },
            "email": {
              "type": "text",
              "analyzer": "email"
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

您注意到它有一个自定义email分析器,可以将类似内容foo@bar.com分成以下标记:foo@bar.com, foo, bar.com, bar, com

正如文档所述,您可以创建一个单独的渗透器索引,该索引仅保存您的渗透器查询,而不保存文档本身。而且,即使渗透器索引本身不包含文档,它也应该保存保存文档的索引的映射(test在我们的例子中)。

这是渗透器索引(我称之为percolator_index)的映射,它还具有用于分割email字段的特殊分析器:

{  
    "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
    "mappings": {
        "properties": {
            "query": {
                "type": "percolator"
            },
            "code": {
                "type": "long"
            },
            "date": {
                "type": "date"
            },
            "part": {
                "type": "text",
                "fields": {
                    "keyword": {
                        "type": "keyword",
                        "ignore_above": 256
                    }
                }
            },
            "val": {
                "type": "long"
            },
            "email": {
              "type": "text",
              "analyzer": "email"
            }
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

它的映射和设置与我的原始索引几乎相同,唯一的区别是添加到映射query的类型的附加字段。percolator

你感兴趣的查询它——simple_query_string应该放到一个文档里面percolator_index。就像这样:

PUT /percolator_index/_doc/1?refresh
{
    "query": {
        "simple_query_string" : {
            "query" : "month foo@bar.com",
            "fields": ["part", "email"]
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

为了让它更有趣,我添加了email在其中添加了要在查询中专门搜索的字段(默认情况下,会搜索所有字段)。

现在,我们的目标是测试一个文档,该文档最终应根据渗透器索引中的test此查询进入索引。simple_query_string例如:

GET /percolator_index/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo@bar.com"
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

显然,下面document是您未来(尚不存在)的文档。这将与上面定义的进行匹配simple_query_string并产生匹配:

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.39324823,
        "hits": [
            {
                "_index": "percolator_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.39324823,
                "_source": {
                    "query": {
                        "simple_query_string": {
                            "query": "month foo@bar.com",
                            "fields": [
                                "part",
                                "email"
                            ]
                        }
                    }
                },
                "fields": {
                    "_percolator_document_slot": [
                        0
                    ]
                }
            }
        ]
    }
}
Run Code Online (Sandbox Code Playgroud)

如果我改为渗透此文档会怎样:

{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "date":"2004-07-31T11:57:52.000Z","part":"month","code":109,"val":0,"email":"foo"
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

(注意,电子邮件只是foo)这是结果:

{
    "hits": {
        "total": {
            "value": 1,
            "relation": "eq"
        },
        "max_score": 0.26152915,
        "hits": [
            {
                "_index": "percolator_index",
                "_type": "_doc",
                "_id": "1",
                "_score": 0.26152915,
                "_source": {
                    "query": {
                        "simple_query_string": {
                            "query": "month foo@bar.com",
                            "fields": [
                                "part",
                                "email"
                            ]
                        }
                    }
                },
                "fields": {
                    "_percolator_document_slot": [
                        0
                    ]
                }
            }
        ]
    }
}
Run Code Online (Sandbox Code Playgroud)

请注意,分数比第一个渗透文档要低一些。这可能是这样的,因为foo(我的电子邮件)仅匹配我分析的其中一个术语foo@bar.com,而foo@bar.com会匹配所有术语(从而给出更好的分数)

但不确定你在谈论什么分析仪。我认为上面的例子涵盖了唯一的“分析器”问题/未知,我认为可能有点令人困惑。

  • 仅供参考,我创建了 https://github.com/elastic/elasticsearch/issues/48874,因为要么存在未记录的问题,要么本身存在错误。 (2认同)