Sch*_*r22 3 xml wikipedia elasticsearch
我想一个XML维基百科转储像加载: http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20171001/enwiki-20171001-pages-articles.xml.bz2 到Elasticsearch(5.6。 4)。但是,我发现的所有工具和教程都已过时,并且与我的 Elasticsearch 版本不兼容。谁能解释将转储导入 Elasticsearch 的最佳方法是什么?
两年前,维基媒体提供了生产弹性搜索索引的转储。
索引每周导出一次,每个 wiki 都有两次导出。
The content index, which contains only article pages, called content;
The general index, containing all pages. This includes talk pages, templates, etc, called general;
Run Code Online (Sandbox Code Playgroud)
你可以在这里找到它们http://dumps.wikimedia.org/other/cirrussearch/current/
根据您的需要创建映射。例如:
{
"mappings": {
"page": {
"properties": {
"auxiliary_text": {
"type": "text"
},
"category": {
"type": "text"
},
"coordinates": {
"properties": {
"coord": {
"properties": {
"lat": {
"type": "double"
},
"lon": {
"type": "double"
}
}
},
"country": {
"type": "text"
},
"dim": {
"type": "long"
},
"globe": {
"type": "text"
},
"name": {
"type": "text"
},
"primary": {
"type": "boolean"
},
"region": {
"type": "text"
},
"type": {
"type": "text"
}
}
},
"defaultsort": {
"type": "boolean"
},
"external_link": {
"type": "text"
},
"heading": {
"type": "text"
},
"incoming_links": {
"type": "long"
},
"language": {
"type": "text"
},
"namespace": {
"type": "long"
},
"namespace_text": {
"type": "text"
},
"opening_text": {
"type": "text"
},
"outgoing_link": {
"type": "text"
},
"popularity_score": {
"type": "double"
},
"redirect": {
"properties": {
"namespace": {
"type": "long"
},
"title": {
"type": "text"
}
}
},
"score": {
"type": "double"
},
"source_text": {
"type": "text"
},
"template": {
"type": "text"
},
"text": {
"type": "text"
},
"text_bytes": {
"type": "long"
},
"timestamp": {
"type": "date",
"format": "strict_date_optional_time||epoch_millis"
},
"title": {
"type": "text"
},
"version": {
"type": "long"
},
"version_type": {
"type": "text"
},
"wiki": {
"type": "text"
},
"wikibase_item": {
"type": "text"
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
}
创建索引后,只需键入:
zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'
Run Code Online (Sandbox Code Playgroud)
享受!
| 归档时间: |
|
| 查看次数: |
2267 次 |
| 最近记录: |