用python从网上抓取表格

0 python parsing beautifulsoup html-parsing web-scraping

我正在尝试从该网站获取洞表(所有 1000 多所大学) - https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25 /sort_by/rank/sort_order/asc/cols/scores

为了这个目标,我使用了以下库 - requests 和 BeautifulSoup,我的代码是:

import requests
from bs4 import BeautifulSoupenter 

html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup = bs4.BeautifulSoup(html_content, 'lxml')
Run Code Online (Sandbox Code Playgroud)

然后我在找一张桌子:

table = soup.find_all('table')[0]
Run Code Online (Sandbox Code Playgroud)

但结果,我看不到表本身<tbody>、行<tr>和列<td>

HTML代码:

请帮助米?从该站点获取所有信息并从中构建数据框。

SIM*_*SIM 5

试试下面的方法。如果您查看 devtools 下网络选项卡中 xhr 部分的网络活动,您可以获得 url。但是,这就是您的脚本从 json 响应中获取数据的样子。

import requests

URL = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"

res = requests.get(URL)
for items in res.json()['data']:
    rank = items['rank']
    name = items['name']
    intstudents = items['stats_pc_intl_students']
    ratio = items['stats_female_male_ratio']
    print(rank,name,intstudents,ratio)
Run Code Online (Sandbox Code Playgroud)

输出:

1 University of Oxford 38% 46 : 54
2 University of Cambridge 35% 45 : 55
=3 California Institute of Technology 27% 31 : 69
=3 Stanford University 22% 42 : 58
5 Massachusetts Institute of Technology 34% 37 : 63
6 Harvard University 26% None
Run Code Online (Sandbox Code Playgroud)