使用 requests_html 和 pyppeteer python 发送点击

Question

使用 requests_html 和 pyppeteer python 发送点击

我正在尝试登录一个网站，单击一个按钮，然后抓取一些数据。该页面必须呈现，因为它全部使用 JavaScript（因此如果您 [例如] 在 Web 浏览器中查看源代码，则不可用）。

除了发送点击的时间外，一切正常。

当我尝试用requests_html包裹发送点击时，它似乎没有做任何事情，尽管没有抛出任何错误。我知道它在很大程度上依赖于pyppeteer，所以我一直试图在文档之间跳转，但整个异步编程的事情对我来说非常混乱。

import asyncio
import requests_html

# Login information
payload = {
    'email': 'example@gmail.com',
    'password': 'Password123'
}

# Start a session
with requests_html.HTMLSession() as s:
    p = s.post('https://www.website.com/login', data=payload)

    # Send the request now that we're logged in
    r = s.get('https://www.website.com/data')

    # Render the JavaScript page so it's accessible
    r.html.render(keep_page=True, scrolldown=5, sleep=5)

    async def click():
        await r.html.page.click(
                                selector='button.showAll', 
                                options={'delay':3, 'clickCount':1},              
                                )

    asyncio.get_event_loop().run_until_complete(click())

    print(r.html.html)

Run Code Online (Sandbox Code Playgroud)

r.html.html包含来自 JS 的渲染 HTML，但不包含单击按钮。我已经确认按钮正在被点击，但我怀疑新页面没有以某种方式被“保存”，那r.html.html就是返回预先点击的页面。

我宁愿不使用已弃用的 PhantomJS/Selenium。Scrapy 真的很重，我宁愿不依赖 Scrapy + Splash 来完成这件事——我想我已经很接近了！MechanicalSoup 不适用于 JavaScript。

Answer 1

小智 2

根据request_html最新文档，您可以将脚本参数传递给html 对象的render方法。这相当于执行( pyppeteer )页面属性的评估方法，请参阅requests_html.py（第 523 行）。例如（警告：快速而肮脏的代码）：

from requests_html import HTMLSession session = HTMLSession() r = session.get("http://xy.com") script = """ () => { const item = document.getElementById("foo"); if(item) { item.click() } } """ r.html.render(sleep=sleep, timeout=timeout, script=script)
Run Code Online (Sandbox Code Playgroud)
请记住提供适当的睡眠间隔以确保渲染完成。我已经对其进行了测试，结果是正确的（当单击按钮时，页面正在执行额外的请求以添加更多信息，应用脚本后我可以找到该信息）。

归档时间：	7 年，4 月前
查看次数：	1540 次
最近记录：	5 年，4 月前