Daw*_*n17 3 python beautifulsoup web-scraping python-requests
我正在尝试解析Leetcode上的面试题内容。
例如,在https://leetcode.com/problems/two-sum/上,
我想得到
Given an array of integers, return indices of the two numbers such that they add up to a specific target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
Run Code Online (Sandbox Code Playgroud)
看起来并没有那么难。我使用 requests 和 BeautifulSoup 来做到这一点:
url = 'https://leetcode.com/graphql/two-sum'
try:
page = requests.get(url)
except (requests.exceptions.ReadTimeout,requests.exceptions.ConnectTimeout):
print('time out')
return 'time out'
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
Run Code Online (Sandbox Code Playgroud)
但是,通过开发者控制台(F12)在页面的响应中可以看到,该响应不包含页面显示的内容。
有办法获取此内容吗?
你不需要硒。该页面对动态内容执行 POST 请求。基本上,将 MySql 查询发送到后端数据库。因此,执行以下操作要快得多:
import requests
from bs4 import BeautifulSoup as bs
data = {"operationName":"questionData","variables":{"titleSlug":"two-sum"},"query":"query questionData($titleSlug: String!) {\n question(titleSlug: $titleSlug) {\n questionId\n questionFrontendId\n boundTopicId\n title\n titleSlug\n content\n translatedTitle\n translatedContent\n isPaidOnly\n difficulty\n likes\n dislikes\n isLiked\n similarQuestions\n contributors {\n username\n profileUrl\n avatarUrl\n __typename\n }\n langToValidPlayground\n topicTags {\n name\n slug\n translatedName\n __typename\n }\n companyTagStats\n codeSnippets {\n lang\n langSlug\n code\n __typename\n }\n stats\n hints\n solution {\n id\n canSeeDetail\n __typename\n }\n status\n sampleTestCase\n metaData\n judgerAvailable\n judgeType\n mysqlSchemas\n enableRunCode\n enableTestMode\n envInfo\n libraryUrl\n __typename\n }\n}\n"}
r = requests.post('https://leetcode.com/graphql', json = data).json()
soup = bs(r['data']['question']['content'], 'lxml')
title = r['data']['question']['title']
question = soup.get_text().replace('\n',' ')
print(title, '\n', question)
Run Code Online (Sandbox Code Playgroud)