avi*_*iss 5 wikipedia beautifulsoup web-scraping python-3.x
我在整理 wiki 表格时遇到了麻烦,希望以前做过的人可以给我建议。从 List_of_current_heads_of_state_and_government 我需要国家(使用下面的代码),然后只第一次提到国家元首+他们的名字。我不确定如何隔离第一次提及,因为它们都出现在一个单元格中。我试图提取他们的名字给了我这个错误:IndexError: list index out of range. 将感谢您的帮助!
import requests
from bs4 import BeautifulSoup
wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
website_url = requests.get(wiki).text
soup = BeautifulSoup(website_url,'lxml')
my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
#print(my_table)
states = []
titles = []
names = []
for row in my_table.find_all('tr')[1:]:
state_cell = row.find_all('a')[0]
states.append(state_cell.text)
print(states)
for row in my_table.find_all('td'):
title_cell = row.find_all('a')[0]
titles.append(title_cell.text)
print(titles)
for row in my_table.find_all('td'):
name_cell = row.find_all('a')[1]
names.append(name_cell.text)
print(names)
Run Code Online (Sandbox Code Playgroud)
理想的输出将是熊猫 df:
State | Title | Name |
Run Code Online (Sandbox Code Playgroud)
rup*_*rup 13
我找到了一个超级简单和简短的方法来做到这一点,通过导入wikipediapython 模块,然后使用 pandas'read_html将它放入一个数据帧中。
从那里你可以应用任何你想要的分析。
import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
Run Code Online (Sandbox Code Playgroud)
或者,如果您想从命令行调用它:
只需致电 python yourfile.py -p Wikipedia_Page_Article_Here
import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try:
df = pd.read_html(html)[1] # Try 2nd table first as most pages contain contents table first
except IndexError:
df = pd.read_html(html)[0]
print(df.to_string())
Run Code Online (Sandbox Code Playgroud)
希望这可以帮助那里的人!
如果我能理解你的问题,那么以下内容应该可以帮助你:
\n\nimport requests\nfrom bs4 import BeautifulSoup\n\nURL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"\n\nres = requests.get(URL).text\nsoup = BeautifulSoup(res,\'lxml\')\nfor items in soup.find(\'table\', class_=\'wikitable\').find_all(\'tr\')[1::1]:\n data = items.find_all([\'th\',\'td\'])\n try:\n country = data[0].a.text\n title = data[1].a.text\n name = data[1].a.find_next_sibling().text\n except IndexError:pass\n print("{}|{}|{}".format(country,title,name))\nRun Code Online (Sandbox Code Playgroud)\n\n输出:
\n\nAfghanistan|President|Ashraf Ghani\nAlbania|President|Ilir Meta\nAlgeria|President|Abdelaziz Bouteflika\nAndorra|Episcopal Co-Prince|Joan Enric Vives Sic\xc3\xadlia\nAngola|President|Jo\xc3\xa3o Louren\xc3\xa7o\nAntigua and Barbuda|Queen|Elizabeth II\nArgentina|President|Mauricio Macri\nRun Code Online (Sandbox Code Playgroud)\n\n等等 - -
\n| 归档时间: |
|
| 查看次数: |
11826 次 |
| 最近记录: |