在python 3中使用requests.get获取数据之前等待页面加载

Question

在python 3中使用requests.get获取数据之前等待页面加载

rib*_*bas 11 beautifulsoup web-scraping python-3.x python-requests

我有一个页面,我需要获取与BS4一起使用的源,但页面中间需要1秒(可能更少)来加载内容,并且requests.get在加载部分之前捕获页面的源,如何在获取数据之前,我可以等一下吗？

r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 )
    soup = BeautifulSoup(r.content, 'html.parser')
    a = soup.find_all('section', 'wrapper')

Run Code Online (Sandbox Code Playgroud)

这页纸

<section class="wrapper" id="resultado_busca">

Run Code Online (Sandbox Code Playgroud)

Answer 1

Eno*_*och 21

我遇到了同样的问题，提交的答案都没有真正对我有用。但经过长时间的研究，我找到了解决方案：

from requests_html import HTMLSession
s = HTMLSession()
response = s.get(url)
response.html.render()

print(response)
# prints out the content of the fully loaded page
# response can be parsed with for example bs4

Run Code Online (Sandbox Code Playgroud)

该requests_html包 ( docs ) 是一个官方包，由 Python 软件基金会分发。它具有一些额外的 JavaScript 功能，例如等待页面 JS 完成加载的能力。

该软件包目前仅支持 Python 3.6 及更高版本，因此可能无法与其他版本一起使用。

@IbtsamCh 是的！有两种方法：在渲染中使用“wait”参数在 JavaScript 渲染之前添加以秒为单位的等待时间，并使用“sleep”参数在 js 渲染之后添加以秒为单位的等待时间**。两个参数都只接受整数值。示例： `response.html.render(wait=2, sleep=3)` _在 JavaScript 渲染之前等待 2 秒，之后等待 3 秒。_ (6认同)

我必须执行“print(response.text)”才能真正打印任何内容 (2认同)

Answer 2

Vin*_*iar 19

它看起来不像是等待的问题,它看起来像是由JavaScript创建的元素,requests无法通过JavaScript 处理动态生成的元素.建议是selenium与PhantomJS获取页面源一起使用,然后您可以BeautifulSoup用于解析,下面显示的代码将完全执行以下操作:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "http://legendas.tv/busca/walking%20dead%20s03e02"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')

Run Code Online (Sandbox Code Playgroud)

此外,.findAll如果您只查找一个元素,则无需使用.

更新:已弃用对PhantomJS的Selenium支持,您应该使用无头版Chrome或Firefox. (7认同)

Answer 3

Zuk*_*uku 14

Selenium 是解决这个问题的好方法，但已接受的答案已被相当弃用。正如 @Seth 在评论中提到的，应该使用 Firefox/Chrome（或可能其他浏览器）的无头模式而不是 PhantomJS。

首先，您需要下载特定的驱动程序：
Geckodriver for Firefox
ChromeDriver for Chrome

接下来，您可以将下载的驱动程序的路径添加到系统 PATH 变量中。但这不是必需的，您还可以在代码中指定可执行文件所在的位置。

火狐浏览器：

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

Run Code Online (Sandbox Code Playgroud)

对于 Chrome 来说也是如此：

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Chrome(options=options, executable_path='YOUR_PATH/chromedriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

Run Code Online (Sandbox Code Playgroud)

最好记住browser.quit()避免在代码执行后挂起进程。如果您担心您的代码可能会在浏览器被释放之前失败，您可以将其包装在try...except块中并放入browser.quit()部分finally以确保它会被调用。

此外，如果使用该方法仍未加载部分源代码，您可以要求 selenium 等待特定元素出现：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')

try:
    browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
    timeout_in_seconds = 10
    WebDriverWait(browser, timeout_in_seconds).until(ec.presence_of_element_located((By.ID, 'resultado_busca')))
    html = browser.page_source
    soup = BeautifulSoup(html, features="html.parser")
    print(soup)
except TimeoutException:
    print("I give up...")
finally:
    browser.quit()

Run Code Online (Sandbox Code Playgroud)

如果您对 Firefox 或 Chrome 之外的其他驱动程序感兴趣，请查看文档。

Answer 4

Ing*_*wan 5

在Python 3中，实际使用模块urllib在加载动态网页时比requests模块更好。

即

import urllib.request
try:
    with urllib.request.urlopen(url) as response:

        html = response.read().decode('utf-8')#use whatever encoding as per the webpage
except urllib.request.HTTPError as e:
    if e.code==404:
        print(f"{url} is not found")
    elif e.code==503:
        print(f'{url} base webservices are not available')
        ## can add authentication here 
    else:
        print('http error',e)

Run Code Online (Sandbox Code Playgroud)

对我来说没有什么区别。我收到了带有 html 骨架结构的 200 返回，但主 div 中没有填充使用 Web 浏览器时应有的数据。 (3认同)

Answer 5

she*_*der 5

我找到了方法！！！

r = requests.get('https://github.com', timeout=(3.05, 27))

Run Code Online (Sandbox Code Playgroud)

在这里，超时有两个值，第一个是设置会话超时，第二个是您需要的。第二个决定在多少秒后发送响应。您可以计算填充所需的时间，然后将数据打印出来。

设置超时=无对我有用。https://requests.readthedocs.io/en/master/user/advanced/#timeouts (5认同)

归档时间：	8 年，3 月前
查看次数：	18678 次
最近记录：	7 年，1 月前