与ElasticSearch匹配的精确文档

vad*_*vad 2 lucene elasticsearch

我需要完全查询一组"短文档".例:

文件:

  1. {"name":"John Doe","alt":"John W Doe"}
  2. {"name":"我的朋友John Doe","alt":"John A Doe"}
  3. {"name":"John","alt":"Susy"}
  4. {"name":"Jack","alt":"John Doe"}

预期成绩:

  1. 如果我搜索"John Doe",我希望得分1比得分2和4大得多
  2. 如果我搜索"JohnDoé",就像上面一样
  3. 如果我搜索"John",我想得到3(完全匹配比名称和alt中的重复更好)

ES有可能吗?我怎样才能实现这一目标?我尝试提升"名称",但我找不到如何与文档字段完全匹配,而不是在其中搜索.

DrT*_*ech 5

您所描述的是搜索引擎默认情况下的工作方式.搜索"John Doe"成为搜索条款"john""doe".对于每个术语,它会查找包含该术语的文档,然后根据以下内容_score为每个文档分配:

  • 这个术语在所有文件中的常见程度(更常见==相关性较低)
  • 文档字段内的术语有多常见(更常见==更相关)
  • 文档的字段有多长(更长==不太相关)

您没有看到明确结果的原因是Elasticsearch是分布式的,您正在使用少量数据进行测试.默认情况下,索引具有5个主分片,并且您的文档在不同分片上编制索引.每个分片都有自己的doc频率计数,因此分数会被扭曲.

当您添加实际数量的数据时,频率甚至会超过分片,但是为了测试少量数据,您需要执行以下两项操作之一:

  1. 创建只包含一个主分片的索引,或
  2. 指定search_type=dfs_query_then_fetch在使用全局频率运行查询之前首先从每个分片中获取频率

要演示,首先索引您的数据:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'
Run Code Online (Sandbox Code Playgroud)

现在,搜索"john doe",记住指定dfs_query_then_fetch.

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doe"
      }
   }
}
'
Run Code Online (Sandbox Code Playgroud)

Doc 1是结果中的第一个:

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 8
# }
Run Code Online (Sandbox Code Playgroud)

当您搜索时"john":

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john"
      }
   }
}
'
Run Code Online (Sandbox Code Playgroud)

Doc 3首先出现:

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 1,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 0.625,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.5,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }
Run Code Online (Sandbox Code Playgroud)

忽略重音

第二个问题是匹配"John Doé".这是一个分析问题.为了使全文更易于搜索,我们将其分析为单独的术语或标记,这些是存储在索引中的内容.为了匹配例如john,JohnJOHN当用户搜索时john,每个术语/令牌都通过许多令牌过滤器传递,以将它们放入标准形式.

当我们进行全文搜索时,搜索条件会经历完全相同的过程.因此,如果我们有一个包含的文档,则将其John编入索引john,如果用户搜索JOHN,我们实际上会搜索john.

为了Doé匹配doe,我们需要一个删除重音的令牌过滤器,我们需要将它应用于被索引的文本和搜索词.最简单的方法是使用ASCII折叠令牌过滤器.

我们可以在创建索引时定义自定义分析器,并且我们可以在映射中指定特定字段应该在索引时和搜索时使用该分析器.

首先,删除旧索引:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1' 
Run Code Online (Sandbox Code Playgroud)

然后创建索引,指定自定义分析器和映射:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_accents" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "asciifolding"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   },
   "mappings" : {
      "test" : {
         "properties" : {
            "name" : {
               "type" : "string",
               "analyzer" : "no_accents"
            }
         }
      }
   }
}
'
Run Code Online (Sandbox Code Playgroud)

重新索引数据:

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'
Run Code Online (Sandbox Code Playgroud)

现在,测试搜索:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doé"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 6
# }
Run Code Online (Sandbox Code Playgroud)