在ElasticSearch中检索所有_ids的有效方法

Question

在ElasticSearch中检索所有_ids的有效方法

Mah*_*oni 60 elasticsearch

从ElasticSearch获取某个索引的所有_id的最快方法是什么？是否可以使用简单的查询？我的一个索引有大约20,000个文档.

Answer 1

Tho*_*ten 62

编辑:请阅读@Aleck Landgraf的答案

你只想要elasticsearch-internal _id字段？或者id文档中的字段？

对于前者,试试吧

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "stored_fields": []
}
'

Run Code Online (Sandbox Code Playgroud)

注释2017更新:最初包含的帖子"fields": []但从那时起名称已更改并且stored_fields是新值.

结果将仅包含文档的"元数据"

{
  "took" : 7,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "index",
      "_type" : "type",
      "_id" : "36",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "38",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "39",
      "_score" : 1.0
    }, {
      "_index" : "index",
      "_type" : "type",
      "_id" : "34",
      "_score" : 1.0
    } ]
  }
}

Run Code Online (Sandbox Code Playgroud)

对于后者,如果要在文档中包含字段,只需将其添加到fields数组中即可

curl http://localhost:9200/index/type/_search?pretty=true -d '
{ 
    "query" : { 
        "match_all" : {} 
    },
    "fields": ["document_field_to_be_returned"]
}
'

Run Code Online (Sandbox Code Playgroud)

这不会只返回10个结果吗？ (8认同)
在5.x中不再起作用,字段`fields`被删除,而是添加``_source':false`param. (5认同)

Answer 2

Ale*_*raf 46

最好使用滚动和扫描来获取结果列表,以便elasticsearch不必对结果进行排序和排序.

使用elasticsearch-dslpython lib,可以通过以下方式完成:

from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)

s = s.fields([])  # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]

Run Code Online (Sandbox Code Playgroud)

控制台日志:

GET http://localhost:9200/my_index/my_doc/_search?search_type=scan&scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.003s]
GET http://localhost:9200/_search/scroll?scroll=5m [status:200 request:0.005s]
...

Run Code Online (Sandbox Code Playgroud)

注意:滚动从查询中提取批量结果并保持光标打开一段给定的时间(1分钟,2分钟,您可以更新); 扫描禁用排序.该scan辅助函数返回一个python发生器可通过安全地重复.

方法`fields`已在版本`5.0.0中删除(参见:https://elasticsearch-dsl.readthedocs.io/en/latest/Changelog.html？highlight = fields(#id2).你现在应该使用` s = s.source([])`. (14认同)
search_type = scan自2.1以来已弃用.([https://www.elastic.co/guide/en/elasticsearch/reference/2.1/breaking_21_search_changes.html](https://www.elastic.co/guide/en/elasticsearch/reference/2.1/breaking_21_search_changes.html )) (4认同)

Answer 3

Nav*_*Nav 15

对于elasticsearch 5.x,您可以使用" _source "字段.

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Run Code Online (Sandbox Code Playgroud)

"fields"已被弃用.(错误:"不再支持字段[字段],如果字段未存储,请使用[stored_fields]检索存储的字段或_source过滤")

添加错误文本可获得奖励积分。Elasticsearch 错误消息大多看起来不太适合谷歌搜索:( (2认同)

Answer 4

Bri*_*Low 13

另外一个选项

curl 'http://localhost:9200/index/type/_search?pretty=true&fields='

Run Code Online (Sandbox Code Playgroud)

将返回_index,_type,_id和_score.

-1访问多个文档时,最好使用扫描和滚动.这是一种"快速方式",但不会表现良好,也可能在大型指数上失败 (2认同)

Answer 5

san*_*ler 8

详细说明@Robert-Lujo 和@Aleck-Landgraf 的 2 个答案（具有权限的人可以很乐意将其移至评论中）：如果您不想打印但从返回的生成器中获取列表中的所有内容，这就是我用：

from elasticsearch import Elasticsearch,helpers
es = Elasticsearch(hosts=[YOUR_ES_HOST])
a=helpers.scan(es,query={"query":{"match_all": {}}},scroll='1m',index=INDEX_NAME)#like others so far

IDs=[aa['_id'] for aa in a]

Run Code Online (Sandbox Code Playgroud)

Answer 6

Ale*_*emi 5

我知道这篇文章有很多答案，但我想结合几个来记录我发现最快的（无论如何在 Python 中）。我正在处理数亿份文件，而不是数千份。

的helpers类可以用使用切片滚动，因此允许多线程执行。就我而言，我也有一个高基数字段要提供 ( acquired_at) 。您会看到我设置max_workers为 14，但您可能希望根据您的机器进行更改。

此外，我以压缩格式存储文档 ID。如果你很好奇，你可以检查你的文档 ID 有多少字节，并估计最终的转储大小。

# note below I have es, index, and cluster_name variables already set

max_workers = 14
scroll_slice_ids = list(range(0,max_workers))

def get_doc_ids(scroll_slice_id):
    count = 0
    with gzip.open('/tmp/doc_ids_%i.txt.gz' % scroll_slice_id, 'wt') as results_file:
        query = {"sort": ["_doc"], "slice": { "field": "acquired_at", "id": scroll_slice_id, "max": len(scroll_slice_ids)+1}, "_source": False}
        scan = helpers.scan(es, index=index, query=query, scroll='10m', size=10000, request_timeout=600)
        for doc in scan:
            count += 1
            results_file.write((doc['_id'] + '\n'))
            results_file.flush()

    return count 

if __name__ == '__main__':
    print('attempting to dump doc ids from %s in %i slices' % (cluster_name, len(scroll_slice_ids)))
    with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        doc_counts = executor.map(get_doc_ids, scroll_slice_ids)

Run Code Online (Sandbox Code Playgroud)

如果您想了解文件中有多少个 ID，可以使用unpigz -c /tmp/doc_ids_4.txt.gz | wc -l.

归档时间：	12 年，7 月前
查看次数：	59225 次
最近记录：	6 年，3 月前