用python从网上抓取表格

Question

用python从网上抓取表格

0 python parsing beautifulsoup html-parsing web-scraping

我正在尝试从该网站获取洞表（所有 1000 多所大学） - https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25 /sort_by/rank/sort_order/asc/cols/scores。

为了这个目标，我使用了以下库 - requests 和 BeautifulSoup，我的代码是：

import requests
from bs4 import BeautifulSoupenter 

html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup = bs4.BeautifulSoup(html_content, 'lxml')

Run Code Online (Sandbox Code Playgroud)

然后我在找一张桌子：

table = soup.find_all('table')[0]

Run Code Online (Sandbox Code Playgroud)

但结果，我看不到表本身<tbody>、行<tr>和列<td>。

HTML代码：

请帮助米？从该站点获取所有信息并从中构建数据框。

Answer 1

SIM*_*SIM 5

试试下面的方法。如果您查看 devtools 下网络选项卡中 xhr 部分的网络活动，您可以获得 url。但是，这就是您的脚本从 json 响应中获取数据的样子。

import requests

URL = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"

res = requests.get(URL)
for items in res.json()['data']:
    rank = items['rank']
    name = items['name']
    intstudents = items['stats_pc_intl_students']
    ratio = items['stats_female_male_ratio']
    print(rank,name,intstudents,ratio)

Run Code Online (Sandbox Code Playgroud)

输出：

1 University of Oxford 38% 46 : 54
2 University of Cambridge 35% 45 : 55
=3 California Institute of Technology 27% 31 : 69
=3 Stanford University 22% 42 : 58
5 Massachusetts Institute of Technology 34% 37 : 63
6 Harvard University 26% None

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，10 月前
查看次数：	676 次
最近记录：	7 年，10 月前