我有以下索引文档映射(简化)
{
"documents": {
"mappings": {
"document": {
"properties": {
"filename": {
"type": "string",
"fields": {
"lower_case_sort": {
"type": "string",
"analyzer": "case_insensitive_sort"
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
我将两个文档放入该索引
{
"_index": "documents",
"_type": "document",
"_id": "777",
"_source": {
"filename": "text.txt",
}
}
Run Code Online (Sandbox Code Playgroud)
...
{
"_index": "documents",
"_type": "document",
"_id": "888",
"_source": {
"filename": "text 123.txt",
}
}
Run Code Online (Sandbox Code Playgroud)
对“文本”进行 query_string 或 simple_query_string 查询我希望能返回两个文档。它们应该匹配,因为文件名是“text.txt”和“text 123.txt”。
http://localhost:9200/defiant/_search?q=text
Run Code Online (Sandbox Code Playgroud)
但是,我只找到名称为“test 123.txt”的文档 - 仅当我搜索“test.*”或“test.txt”或“test.???”时才能找到“test.txt” - 我必须在文件名中添加点。
这是我针对文档 id 777 (text.txt) 的解释结果
curl -XGET 'http://localhost:9200/documents/document/777/_explain' -d '{"query": {"query_string" : {"query" : "text"}}}'
Run Code Online (Sandbox Code Playgroud)
-->
{
"_index": "documents",
"_type": "document",
"_id": "777",
"matched": false,
"explanation": {
"value": 0.0,
"description": "Failure to meet condition(s) of required/prohibited clause(s)",
"details": [{
"value": 0.0,
"description": "no match on required clause (_all:text)",
"details": [{
"value": 0.0,
"description": "no matching term",
"details": []
}]
}, {
"value": 0.0,
"description": "match on required clause, product of:",
"details": [{
"value": 0.0,
"description": "# clause",
"details": []
}, {
"value": 0.47650534,
"description": "_type:document, product of:",
"details": [{
"value": 1.0,
"description": "boost",
"details": []
}, {
"value": 0.47650534,
"description": "queryNorm",
"details": []
}]
}]
}]
}
}
Run Code Online (Sandbox Code Playgroud)
我搞砸了映射吗?我本以为“.” 当文档被索引时被分析为术语分隔符...
编辑:case_insensitive_sort的设置
{
"documents": {
"settings": {
"index": {
"creation_date": "1473169458336",
"analysis": {
"analyzer": {
"case_insensitive_sort": {
"filter": [
"lowercase"
],
"tokenizer": "keyword"
}
}
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
这将是standard analyzer(默认分析器)的预期行为,因为它使用标准分词器,并且根据它使用的算法,点不被视为分隔字符。
您可以借助analyze api来验证这一点
curl -XGET 'localhost:9200/_analyze' -d '
{
"analyzer" : "standard",
"text" : "test.txt"
}'
Run Code Online (Sandbox Code Playgroud)
仅生成单个令牌
{
"tokens": [
{
"token": "test.txt",
"start_offset": 0,
"end_offset": 8,
"type": "<ALPHANUM>",
"position": 0
}
]
}
Run Code Online (Sandbox Code Playgroud)
您可以使用模式替换字符过滤器将点替换为空白。
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"char_filter": [
"replace_dot"
]
}
},
"char_filter": {
"replace_dot": {
"type": "pattern_replace",
"pattern": "\\.",
"replacement": " "
}
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
您必须重新索引您的文档,然后您将获得所需的结果。分析 api可以非常方便地检查文档如何存储在倒排索引中。
更新
您必须指定要搜索的字段的名称。以下请求在_all 字段中查找文本,默认情况下使用标准分析器。
http://localhost:9200/defiant/_search?q=text
Run Code Online (Sandbox Code Playgroud)
我认为下面的查询应该会给你想要的结果。
curl -XGET 'http://localhost:9200/twitter/_search?q=filename:text'
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1730 次 |
| 最近记录: |