Elasticsearch更像是这个查询

Question

Elasticsearch更像是这个查询

我正在尝试围绕这个查询的工作原理,我似乎错过了一些东西.我阅读了文档,但ES文档通常有点......缺乏.

我们的目标是能够通过词频限制的结果,因为试图在这里.

所以我设置了一个简单的索引,包括用于调试的术语向量,然后添加了两个简单的文档.

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "doc": {
         "properties": {
            "text": {
               "type": "string",
               "term_vector": "yes"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
    "text": "apple, apple, apple, apple, apple"
}

PUT /test_index/doc/2
{
    "text": "apple, apple"
}

Run Code Online (Sandbox Code Playgroud)

当我看到termvectors时,我看到了我的期望:

GET /test_index/doc/1/_termvector
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 5
            }
         }
      }
   }
}

GET /test_index/doc/2/_termvector
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "2",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 2
            }
         }
      }
   }
}

Run Code Online (Sandbox Code Playgroud)

当我运行以下查询时,"min_term_freq": 1我将返回两个文档:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 1,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.5816214,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5816214,
            "_source": {
               "text": "apple, apple, apple, apple, apple"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.5254995,
            "_source": {
               "text": "apple, apple"
            }
         }
      ]
   }
}

Run Code Online (Sandbox Code Playgroud)

但是,如果我增加到"min_term_freq"2(或更多)我什么也得不到,虽然我希望返回两个文件:

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

Run Code Online (Sandbox Code Playgroud)

为什么？我错过了什么？

如果我想设置一个只返回"apple"5次出现的文档的查询,而不是2次出现的文档,那么还有更好的方法吗？

为方便起见,这是代码:

http://sense.qbox.io/gist/341f9f77a6bd081debdcaa9e367f5a39be9359cc

Answer 1

Vin*_*han 8

在进行MLT之前,最小项频率和最小文档频率实际应用于输入.这意味着,由于您的输入文本中只有一次出现苹果,因此最小术语频率设置为2时,苹果从未获得MLT资格.如果您将输入更改为"苹果苹果",则可以使用 -

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}

Run Code Online (Sandbox Code Playgroud)

min doc频率也是如此.Apple位于至少2个文档中,因此min_doc_freq高达2将符合MLT操作的输入文本.

我认为您不能为此使用MLT。最小频率和最小doc频率约束实际上都应用在输入文本中，而不是比较文档中。另一种方法是使用脚本插件在过滤器脚本端实现此目的-http://stackoverflow.com/questions/28296320/elasticsearch-filter-via-number-of-mentions/28312561#28312561 (2认同)

Answer 2

Fil*_*vic 6

作为这个问题的发布者，我也试图将我的注意力集中在 more_like_this 查询上......

我在网上找到了良好的信息来源，但（在大多数情况下）文档似乎是最有帮助的，所以，这里是文档的链接，以及一些更重要的术语（和/或有点困难）为了理解，所以我添加了我的解释）：

max_query_terms- 将选择的查询词的最大数量（从每个输入文档）。增加此值可以提高准确性，但会降低查询执行速度。默认为 25。

min_term_freq- 最小术语频率，低于该频率的术语将从输入文档中被忽略。默认为 2。

如果该术语在输入文档中出现的次数少于 2（默认）次，则该术语将从输入文档中被忽略，即不会在其他可能的more_like_this文档中搜索。

min_doc_freq- 最低文档频率，低于该频率的术语将从输入文档中被忽略。默认为 5。

这个花了我一秒钟才明白，所以，这是我的解释：

输入文档中的术语必须出现在多少个文档中才能被选为查询术语。

就是这样，我希望我挽救了某人几分钟的生命。:)

干杯!

归档时间：	11 年，1 月前
查看次数：	8049 次
最近记录：	6 年，5 月前