0 python parsing beautifulsoup html-parsing web-scraping
我正在尝试从该网站获取洞表(所有 1000 多所大学) - https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25 /sort_by/rank/sort_order/asc/cols/scores。
为了这个目标,我使用了以下库 - requests 和 BeautifulSoup,我的代码是:
import requests
from bs4 import BeautifulSoupenter
html_content = requests.get('https://www.timeshighereducation.com/world-university-rankings/2018/world-ranking#!/page/0/length/25/sort_by/rank/sort_order/asc/cols/stats')
soup = bs4.BeautifulSoup(html_content, 'lxml')
Run Code Online (Sandbox Code Playgroud)
然后我在找一张桌子:
table = soup.find_all('table')[0]
Run Code Online (Sandbox Code Playgroud)
但结果,我看不到表本身<tbody>、行<tr>和列<td>。
HTML代码:

请帮助米?从该站点获取所有信息并从中构建数据框。
试试下面的方法。如果您查看 devtools 下网络选项卡中 xhr 部分的网络活动,您可以获得 url。但是,这就是您的脚本从 json 响应中获取数据的样子。
import requests
URL = "https://www.timeshighereducation.com/sites/default/files/the_data_rankings/world_university_rankings_2018_limit0_369a9045a203e176392b9fb8f8c1cb2a.json"
res = requests.get(URL)
for items in res.json()['data']:
rank = items['rank']
name = items['name']
intstudents = items['stats_pc_intl_students']
ratio = items['stats_female_male_ratio']
print(rank,name,intstudents,ratio)
Run Code Online (Sandbox Code Playgroud)
输出:
1 University of Oxford 38% 46 : 54
2 University of Cambridge 35% 45 : 55
=3 California Institute of Technology 27% 31 : 69
=3 Stanford University 22% 42 : 58
5 Massachusetts Institute of Technology 34% 37 : 63
6 Harvard University 26% None
Run Code Online (Sandbox Code Playgroud)