chh*_*ing 5 playwright playwright-python
当您访问此链接时,该页面将运行一些 javascript,然后自动重定向到pdf。我很难从剧作家那里得到最终的网址。
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto("https://scnv.io/760y", wait_until="networkidle")
print(page.url)
page.close()
Run Code Online (Sandbox Code Playgroud)
有没有办法获得最终的网址?
有多种方法可以做到这一点。一种方法是使用page.expect_response:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
# Catch any responses with '.pdf' at the end of the url
with page.expect_response('**/*.pdf') as response:
page.goto("https://scnv.io/760y")
print(response.value.url)
page.close()
Run Code Online (Sandbox Code Playgroud)
输出
https://qcg-media.s3.amazonaws.com/media/uploads/72778/2022/06/20220622_663043_221.pdf
Run Code Online (Sandbox Code Playgroud)
查看文档的这一部分,详细介绍了 playwright 中处理网络流量的情况。
另请注意,我没有包括在内,wait_until='networkidle'
因为这不适合此用例。为了触发该事件,网络必须保持空闲至少 500 毫秒,而本网站在向 pdf 发出请求时不会发生这种情况。因此,如果您要包含该内容,那么代码在捕获我们想要的 url 请求时最多会不一致。
归档时间: |
|
查看次数: |
5903 次 |
最近记录: |