用requests和BeautifulSoup解析leetcode问题内容

Question

用requests和BeautifulSoup解析leetcode问题内容

Daw*_*n17 3 python beautifulsoup web-scraping python-requests

我正在尝试解析Leetcode上的面试题内容。

例如，在https://leetcode.com/problems/two-sum/上，

我想得到

Given an array of integers, return indices of the two numbers such that they add up to a specific target.

You may assume that each input would have exactly one solution, and you may not use the same element twice.

Run Code Online (Sandbox Code Playgroud)

看起来并没有那么难。我使用 requests 和 BeautifulSoup 来做到这一点：

    url = 'https://leetcode.com/graphql/two-sum'
    try:
        page = requests.get(url)
    except (requests.exceptions.ReadTimeout,requests.exceptions.ConnectTimeout):
        print('time out')
        return 'time out'

    soup = BeautifulSoup(page.content, 'html.parser')
    print(soup.prettify())

Run Code Online (Sandbox Code Playgroud)

但是，通过开发者控制台（F12）在页面的响应中可以看到，该响应不包含页面显示的内容。

有办法获取此内容吗？

Answer 1

QHa*_*arr 6

你不需要硒。该页面对动态内容执行 POST 请求。基本上，将 MySql 查询发送到后端数据库。因此，执行以下操作要快得多：

import requests
from bs4 import BeautifulSoup as bs

data = {"operationName":"questionData","variables":{"titleSlug":"two-sum"},"query":"query questionData($titleSlug: String!) {\n  question(titleSlug: $titleSlug) {\n    questionId\n    questionFrontendId\n    boundTopicId\n    title\n    titleSlug\n    content\n    translatedTitle\n    translatedContent\n    isPaidOnly\n    difficulty\n    likes\n    dislikes\n    isLiked\n    similarQuestions\n    contributors {\n      username\n      profileUrl\n      avatarUrl\n      __typename\n    }\n    langToValidPlayground\n    topicTags {\n      name\n      slug\n      translatedName\n      __typename\n    }\n    companyTagStats\n    codeSnippets {\n      lang\n      langSlug\n      code\n      __typename\n    }\n    stats\n    hints\n    solution {\n      id\n      canSeeDetail\n      __typename\n    }\n    status\n    sampleTestCase\n    metaData\n    judgerAvailable\n    judgeType\n    mysqlSchemas\n    enableRunCode\n    enableTestMode\n    envInfo\n    libraryUrl\n    __typename\n  }\n}\n"}

r = requests.post('https://leetcode.com/graphql', json = data).json()
soup = bs(r['data']['question']['content'], 'lxml')
title = r['data']['question']['title']
question =  soup.get_text().replace('\n',' ')
print(title, '\n', question)

Run Code Online (Sandbox Code Playgroud)

归档时间：	6 年，5 月前
查看次数：	2531 次
最近记录：	6 年，5 月前