将维基百科转储加载到 Elasticsearch

Sch*_*r22 3 xml wikipedia elasticsearch

我想一个XML维基百科转储像加载: http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiki/20171001/enwiki-20171001-pages-articles.xml.bz2 到Elasticsearch(5.6。 4)。但是,我发现的所有工具和教程都已过时,并且与我的 Elasticsearch 版本不兼容。谁能解释将转储导入 Elasticsearch 的最佳方法是什么?

Lup*_*ide 6

两年前,维基媒体提供了生产弹性搜索索引的转储。

索引每周导出一次,每个 wiki 都有两次导出。

The content index, which contains only article pages, called content;
The general index, containing all pages. This includes talk pages, templates, etc, called general;
Run Code Online (Sandbox Code Playgroud)

你可以在这里找到它们http://dumps.wikimedia.org/other/cirrussearch/current/

  • 根据您的需要创建映射。例如:

    {
         "mappings": {
         "page": {
            "properties": {
               "auxiliary_text": {
                  "type": "text"
               },
               "category": {
                  "type": "text"
               },
               "coordinates": {
                  "properties": {
                     "coord": {
                        "properties": {
                           "lat": {
                              "type": "double"
                           },
                           "lon": {
                              "type": "double"
                           }
                        }
                     },
                     "country": {
                        "type": "text"
                     },
                     "dim": {
                        "type": "long"
                     },
                     "globe": {
                        "type": "text"
                     },
                     "name": {
                        "type": "text"
                     },
                     "primary": {
                        "type": "boolean"
                     },
                     "region": {
                        "type": "text"
                     },
                     "type": {
                        "type": "text"
                     }
                  }
               },
               "defaultsort": {
                  "type": "boolean"
               },
               "external_link": {
                  "type": "text"
               },
               "heading": {
                  "type": "text"
               },
               "incoming_links": {
                  "type": "long"
               },
               "language": {
                  "type": "text"
               },
               "namespace": {
                  "type": "long"
               },
               "namespace_text": {
                  "type": "text"
               },
               "opening_text": {
                  "type": "text"
               },
               "outgoing_link": {
                  "type": "text"
               },
               "popularity_score": {
                  "type": "double"
               },
               "redirect": {
                  "properties": {
                     "namespace": {
                        "type": "long"
                     },
                     "title": {
                        "type": "text"
                     }
                  }
               },
               "score": {
                  "type": "double"
               },
               "source_text": {
                  "type": "text"
               },
               "template": {
                  "type": "text"
               },
               "text": {
                  "type": "text"
               },
               "text_bytes": {
                  "type": "long"
               },
               "timestamp": {
                  "type": "date",
                  "format": "strict_date_optional_time||epoch_millis"
               },
               "title": {
                  "type": "text"
               },
               "version": {
                  "type": "long"
               },
               "version_type": {
                  "type": "text"
               },
               "wiki": {
                  "type": "text"
               },
               "wikibase_item": {
                  "type": "text"
               }
            }
         }
      }
    
    Run Code Online (Sandbox Code Playgroud)

    }

创建索引后,只需键入:

zcat enwiki-current-cirrussearch-general.json.gz | parallel --pipe -L 2 -N 2000 -j3 'curl -s http://localhost:9200/enwiki/_bulk --data-binary @- > /dev/null'
Run Code Online (Sandbox Code Playgroud)

享受!