字段未按弹性搜索的字母顺序排序

ste*_*hns 3 elasticsearch elasticsearch-mapping

我有一些带有名称字段的文档.我使用名称字段的分析版本进行搜索和not_analyzed排序.排序发生在一个级别,即名称首先按字母顺序排序.但是在字母表列表中,名称按字典顺序排序,而不是按字母顺序排序.这是我使用的映射:

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
Run Code Online (Sandbox Code Playgroud)

任何人都可以为此提供解决方案吗?

Eva*_*kas 13

深入研究Elasticsearch文档,我偶然发现了这个问题:

不区分大小写的排序

想象一下,我们有三个用户文档,其名称字段分别包含Boffey,BROWN和bailey.首先,我们将使用字符串排序和多字段中描述的技术使用not_analyzed字段进行排序:

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": {                    //1
          "type": "string",
          "fields": {
            "raw": {                 //2
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)
  1. analyzed name字段用于搜索.
  2. not_analyzed name.raw字段用于排序.

上述搜索请求将按以下顺序返回文档:BROWN,Boffey,bailey.这被称为词典顺序而不是字母顺序.实质上,用于表示大写字母的字节的值小于用于表示小写字母的字节,因此名称首先按最低字节排序.

这对计算机来说可能是有意义的,但对于那些合理地期望这些名称按字母顺序排序的人来说没有多大意义,无论如何.为了实现这一点,我们需要以字节排序对应于我们想要的排序顺序的方式索引每个名称.

换句话说,我们需要一个能够发出单个小写标记的分析器:

遵循这个逻辑,您需要使用自定义关键字分析器对其进行小写,而不是存储原始文档:

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "case_insensitive_sort" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "seing" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "fields" : {
            "raw" : {
              "type" : "string",
              "analyzer" : "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

现在排序name.raw应该按字母顺序排序,而不是字典顺序.

使用Marvel在我的本地机器上进行快速测试:

指数结构:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_sort": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            },
            "keyword": {
              "type": "string",
              "analyzer": "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

测试数据:

PUT /my_index/user/1
{
  "name": "Tim"
}

PUT /my_index/user/2
{
  "name": "TOM"
}
Run Code Online (Sandbox Code Playgroud)

使用原始字段查询:

POST /my_index/user/_search
{
  "sort": "name.raw"
}
Run Code Online (Sandbox Code Playgroud)

结果:

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "TOM"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "Tim"
  ]
}
Run Code Online (Sandbox Code Playgroud)

使用小写字符串查询:

POST /my_index/user/_search
{
  "sort": "name.keyword"
}
Run Code Online (Sandbox Code Playgroud)

结果:

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "tim"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "tom"
  ]
}
Run Code Online (Sandbox Code Playgroud)

我怀疑你的第二个结果是正确的.


Piw*_*wEL 7

从 Elastic 5.2 开始,您可以使用规范化器来设置不区分大小写的排序。

\n\n

normalizerfields的属性与keyword\n 类似,analyzer只不过它保证分析链\n 生成单个令牌。

\n\n

normalizer在对关键字建立索引之前以及在keyword通过查询解析器(例如match查询)搜索字段时\n应用。

\n\n
PUT index\n{\n  "settings": {\n    "analysis": {\n      "normalizer": {\n        "my_normalizer": {\n          "type": "custom",\n          "char_filter": [],\n          "filter": ["lowercase", "asciifolding"]\n        }\n      }\n    }\n  },\n  "mappings": {\n    "type": {\n      "properties": {\n        "foo": {\n          "type": "keyword",\n          "normalizer": "my_normalizer"\n        }\n      }\n    }\n  }\n}\n\nPUT index/type/1\n{\n  "foo": "B\xc3\x80R"\n}\n\nPUT index/type/2\n{\n  "foo": "bar"\n}\n\nPUT index/type/3\n{\n  "foo": "baz"\n}\n\nPOST index/_refresh\n\nGET index/_search\n{\n  "query": {\n    "match": {\n      "foo": "BAR"\n    }\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

上面的查询匹配文档 1 和 2,因为在索引和查询时间B\xc3\x80R都被转换为bar\n。

\n\n
{\n  "took": $body.took,\n  "timed_out": false,\n  "_shards": {\n    "total": 5,\n    "successful": 5,\n    "failed": 0\n  },\n  "hits": {\n    "total": 2,\n    "max_score": 0.2876821,\n    "hits": [\n      {\n        "_index": "index",\n        "_type": "type",\n        "_id": "2",\n        "_score": 0.2876821,\n        "_source": {\n          "foo": "bar"\n        }\n      },\n      {\n        "_index": "index",\n        "_type": "type",\n        "_id": "1",\n        "_score": 0.2876821,\n        "_source": {\n          "foo": "B\xc3\x80R"\n        }\n      }\n    ]\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

此外,关键字在索引之前进行转换的事实也意味着\n聚合返回标准化值:

\n\n
GET index/_search\n{\n  "size": 0,\n  "aggs": {\n    "foo_terms": {\n      "terms": {\n        "field": "foo"\n      }\n    }\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

回报

\n\n
{\n  "took": 43,\n  "timed_out": false,\n  "_shards": {\n    "total": 5,\n    "successful": 5,\n    "failed": 0\n  },\n  "hits": {\n    "total": 3,\n    "max_score": 0.0,\n    "hits": []\n  },\n  "aggregations": {\n    "foo_terms": {\n      "doc_count_error_upper_bound": 0,\n      "sum_other_doc_count": 0,\n      "buckets": [\n        {\n          "key": "bar",\n          "doc_count": 2\n        },\n        {\n          "key": "baz",\n          "doc_count": 1\n        }\n      ]\n    }\n  }\n}\n
Run Code Online (Sandbox Code Playgroud)\n\n

资料来源:标准化器

\n