18 html python wikipedia beautifulsoup web-scraping
我有这个脚本在Python 3中制作:
response = simple_get("https://en.wikipedia.org/wiki/Mathematics")
result = {}
result["url"] = url
if response is not None:
html = BeautifulSoup(response, 'html.parser')
title = html.select("#firstHeading")[0].text
Run Code Online (Sandbox Code Playgroud)
正如你所看到的,我可以从文章中获得标题,但我无法弄清楚如何从"数学(从希腊语μά..."到内容表中获取文本...
ale*_*cxe 32
从维基百科 - 维基百科API获取信息有一种更简单的方法.
有这个Python包装器,它允许你只用几行HTML来解决它:
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('en')
page = wiki_wiki.page('Mathematics')
print(page.summary)
Run Code Online (Sandbox Code Playgroud)
打印:
数学(来自希腊语μάθημαmáthēma,"知识,学习,学习")包括对数量,结构,空间和变化等主题的研究......(故意省略)
并且,一般情况下,如果有可用的直接API,请尽量避免屏幕抓取.
chi*_*n88 16
选择<p>标签.有52个元素.不确定你是否想要整个事情,但你可以迭代这些标签来存储它.我只是选择打印每一个来显示输出.
import bs4
import requests
response = requests.get("https://en.wikipedia.org/wiki/Mathematics")
if response is not None:
html = bs4.BeautifulSoup(response.text, 'html.parser')
title = html.select("#firstHeading")[0].text
paragraphs = html.select("p")
for para in paragraphs:
print (para.text)
# just grab the text up to contents as stated in question
intro = '\n'.join([ para.text for para in paragraphs[0:5]])
print (intro)
Run Code Online (Sandbox Code Playgroud)
QHa*_*arr 15
使用该库 wikipedia
import wikipedia
#print(wikipedia.summary("Mathematics"))
#wikipedia.search("Mathematics")
print(wikipedia.page("Mathematics").content)
Run Code Online (Sandbox Code Playgroud)
您可以使用lxml以下库获得所需的输出.
import requests
from lxml.html import fromstring
url = "https://en.wikipedia.org/wiki/Mathematics"
res = requests.get(url)
source = fromstring(res.content)
paragraph = '\n'.join([item.text_content() for item in source.xpath('//p[following::h2[2][span="History"]]')])
print(paragraph)
Run Code Online (Sandbox Code Playgroud)
使用BeautifulSoup:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://en.wikipedia.org/wiki/Mathematics")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.find_all("p"):
if item.text.startswith("The history"):break
print(item.text)
Run Code Online (Sandbox Code Playgroud)