如何以纯文本形式提取维基百科页面的所有部分？

Question

如何以纯文本形式提取维基百科页面的所有部分？

我在 python 中有以下代码，它仅提取有关“人工智能”的文章的介绍，而我想提取所有子部分（历史、目标...）

import requests

def get_wikipedia_page(page_title):
  endpoint = "https://en.wikipedia.org/w/api.php"
  params = {
    "format": "json",
    "action": "query",
    "prop": "extracts",
    "exintro": "",
    "explaintext": "",
    "titles": page_title
  }
  response = requests.get(endpoint, params=params)
  data = response.json()
  pages = data["query"]["pages"]
  page_id = list(pages.keys())[0]
  return pages[page_id]["extract"]

page_title = "Artificial intelligence"
wikipedia_page = get_wikipedia_page(page_title)

Run Code Online (Sandbox Code Playgroud)

有人建议使用另一种方法来解析html并使用BeautifulSoup转换为文本：

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Artificial_intelligence"
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

# kill all script and style elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in 
line.split("  
"))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)

print(text)

Run Code Online (Sandbox Code Playgroud)

这不是一个足够好的解决方案，因为它包括网站上出现的所有文本（如图像文本），并且包括文本中的引用（例如 [1]），而第一个脚本将其删除。

我怀疑维基百科的api应该提供一个更优雅的解决方案，如果只能得到第一部分那就太奇怪了？

Answer 1

hc_*_*dev 5

以 HTML 形式检索 Wikipedia 页面

就像在我们的网络浏览器中一样，我们可以通过 URL 检索完整的维基百科页面，并使用Beautiful Soup解析 HTML 响应。

维基百科的 API

作为替代方案，我们可以使用 API，请参阅维基百科的 API 文档。

提取纯文本

使用action=querywith时format=json，您可以使用以下 4 个选项进行文本提取：

titles=Artificial intelligence对于页面
prop=extracts使用TextExtracts扩展
exintro限制对第一个部分标题之前的内容的响应（删除此内容以获取包括部分在内的整个文本）
explaintext提取为纯文本响应而不是 HTML

示例： https: //en.wikipedia.org/w/api.php? action=query&format=json&titles=Artificial%20intelligence&prop=extracts&explaintext

分别获取每个部分

要检索部分，请使用action=parsewithformat=json和 those 选项：

page=Artificial intelligence获取该页面的内容
prop=sections只返回部分

还有一个API 沙箱，您可以在其中尝试多个参数。生成的 GET 请求将检索示例页面“人工智能”的所有部分： https://en.wikipedia.org/wiki/Special: ApiSandbox#action=parse&format=json&page=Artificial%20intelligence&prop=sections&formatversion=2

这将使用包含所有部分的 JSON 进行响应：

{
    "parse": {
        "title": "Artificial intelligence",
        "pageid": 1164,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "History",
                "number": "1",
                "index": "1",
                "fromtitle": "Artificial_intelligence",
                "byteoffset": 5987,
                "anchor": "History",
                "linkAnchor": "History"
            }
}

Run Code Online (Sandbox Code Playgroud)

（简化，仅保留第一部分）

要获取这些部分之一的文本，请将该部分指定为查询参数（通过 id 或标题），例如 section=1&sectiontitle=History： https://en.wikipedia.org/wiki/Special :ApiSandbox#action=parse&format=json&page=Artificial_intelligence§ion= 1§iontitle=历史&formatversion=2

这将检索文本（HTML 格式）：

{
    "parse": {
        "title": "Artificial intelligence",
        "pageid": 1164,
        "revid": 1126677096,
        "text": "<div class=\"mw-parser-output\"><h2><span class=\"mw-headline\" id=\"History\">History</span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[</span><a href=\"/w/index.php?title=Artificial_intelligence&amp;action=edit&amp;section=1\" title=\"Edit section: History\">edit</a><span class=\"mw-editsection-bracket\">]</span></span></h2>\n<style data-mw-deduplicate=\"TemplateStyles:r1033289096\">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div role=\"note\" class=\"hatnote navigation-not-searchable\">Main articles: <a href=\"/wiki/History_of_artificial_intelligence\" title=\"History of artificial intelligence\">History of artificial intelligence</a> and <a href=\"/wiki/Timeline_of_artificial_intelligence\" title=\"Timeline of artificial intelligence\">Timeline of artificial intelligence</a>

Run Code Online (Sandbox Code Playgroud)

注意：以上回复已被截断，仅显示文本示例。

尽管上面的文本内容被格式化为 HTML，但可能有一些选项可以将其设置为纯文本。

也可以看看

Python代码

你也可以像这样使用Python

包裹wikipedia：

{
    "parse": {
        "title": "Artificial intelligence",
        "pageid": 1164,
        "revid": 1126677096,
        "text": "<div class=\"mw-parser-output\"><h2><span class=\"mw-headline\" id=\"History\">History</span><span class=\"mw-editsection\"><span class=\"mw-editsection-bracket\">[</span><a href=\"/w/index.php?title=Artificial_intelligence&amp;action=edit&amp;section=1\" title=\"Edit section: History\">edit</a><span class=\"mw-editsection-bracket\">]</span></span></h2>\n<style data-mw-deduplicate=\"TemplateStyles:r1033289096\">.mw-parser-output .hatnote{font-style:italic}.mw-parser-output div.hatnote{padding-left:1.6em;margin-bottom:0.5em}.mw-parser-output .hatnote i{font-style:normal}.mw-parser-output .hatnote+link+.hatnote{margin-top:-0.5em}</style><div role=\"note\" class=\"hatnote navigation-not-searchable\">Main articles: <a href=\"/wiki/History_of_artificial_intelligence\" title=\"History of artificial intelligence\">History of artificial intelligence</a> and <a href=\"/wiki/Timeline_of_artificial_intelligence\" title=\"Timeline of artificial intelligence\">Timeline of artificial intelligence</a>

Run Code Online (Sandbox Code Playgroud)

来自 Sai Kumar Yava (scionoftech) 的要点requests：使用一个小的 Python 代码来获取纯文本的维基百科页面内容

归档时间：	3 年，1 月前
查看次数：	1904 次
最近记录：	2 年前