标签: python-requests-html

使用“requests-html”时如何获取带有绝对链接路径的原始 html

requests使用库发出请求时https://stackoverflow.com

page = requests.get(url='https://stackoverflow.com')
print(page.content)

Run Code Online (Sandbox Code Playgroud)

我得到以下信息：

<!DOCTYPE html>
    <html class="html__responsive html__unpinned-leftnav">
    <head>
        <title>Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</title>
        <link rel="shortcut icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/favicon.ico?v=ec617d715196">
        <link rel="apple-touch-icon" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a">
        <link rel="image_src" href="https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-icon.png?v=c78bd457575a"> 
..........

Run Code Online (Sandbox Code Playgroud)

这里的这些源代码具有绝对路径，但是当使用requests-htmljs 渲染运行相同的 URL 时

with HTMLSession() as session:
    page = session.get('https://stackoverflow.com')
    page.html.render()
    print(page.content)

Run Code Online (Sandbox Code Playgroud)

我得到以下信息：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>StackOverflow.org</title>
<script type="text/javascript" src="lib/jquery.js"></script>
<script type="text/javascript" src="lib/interface.js"></script>
<script type="text/javascript" src="lib/window.js"></script>
<link href="lib/dock.css" rel="stylesheet" …

Run Code Online (Sandbox Code Playgroud)

python python-3.x python-requests python-requests-html

Mez*_*ezo

2020 12-24

5
推荐指数

1
解决办法

1906
查看次数

Python 请求从 GET 运行 JS 文件

目标

使用 python requests 等登录此网站（https://www.reliant.com）（我知道这可以使用 selenium 或 PhantomJS 或其他东西来完成，但我不想这样做）

问题

在登录过程中，有几个重定向，其中传递“会话 ID”类型参数。其中大部分我都能得到，但有一个似乎dtPC来自您第一次访问该页面时获得的 cookie。据我所知，cookie源自这个JS文件（https://www.reliant.com/ruxitagentjs_ICA2QSVfhjqrux_10175190917092722.js）。该 url 是浏览器在主 url 的初始 GET 后执行的下一个 GET 请求。到目前为止我尝试过的所有方法都无法让我得到那个cookie。

到目前为止的代码

from requests_html import HTMLSession

url=r'https://www.reliant.com'
url2=r'https://www.reliant.com/ruxitagentjs_ICA2QSVfhjqrux_10175190917092722.js'
headers={
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
 'Accept-Encoding': 'gzip, deflate, br',
 'Accept-Language': 'en-US,en;q=0.9',
 'Cache-Control': 'max-age=0',
 'Connection': 'keep-alive',
 'Host': 'www.reliant.com',
 'Sec-Fetch-Mode': 'navigate',
 'Sec-Fetch-Site': 'none',
 'Sec-Fetch-User': '?1',
 'Upgrade-Insecure-Requests': '1',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.3'
}

headers2={
'Referer': 'https://www.reliant.com',
 'Sec-Fetch-Mode': 'no-cors',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; …

Run Code Online (Sandbox Code Playgroud)

javascript python authentication python-requests python-requests-html

Sup*_*tew

lucky-day

4
推荐指数

1
解决办法

4861
查看次数

Python请求响应403禁止

所以我试图抓取这个网站：https://www.auto24.ee 我能够毫无问题地从中抓取数据，但今天它给了我“响应 403”。我尝试使用代理，将更多信息传递给标头，但不幸的是似乎没有任何效果。我在互联网上找不到任何解决方案，我尝试了不同的方法。之前运行没有任何问题的代码：

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.93 Safari/537.36',
}

page = requests.get("https://www.auto24.ee/", headers=headers)

print(page)

Run Code Online (Sandbox Code Playgroud)

python http-status-code-403 python-requests python-requests-html

Ice*_*amz

lucky-day

4
推荐指数

1
解决办法

1万
查看次数

请求-html 包无法为 fast.com 正确呈现

我正在使用 python 3.7 开发一个网络抓取应用程序。我正在使用 requests-html 来解析数据。到目前为止，我已经尝试了以下尝试使用渲染功能的代码（因为 fast.com 上的速度数据是通过 javascript 加载的）。

from requests_html import HTMLSession
quote_page = 'https://fast.com'
session = HTMLSession()
r = session.get(quote_page)
r.html.render()
extract_value = r.html.find('#speed-value', first=True)
print(extract_value.text)

Run Code Online (Sandbox Code Playgroud)

speed-value 是包含速度数据的 div 使用的 id 属性。

但它仍然将速度值打印为 0。

python-3.x python-requests-html

roh*_*wal

2019 09-02

3
推荐指数

1
解决办法

909
查看次数

如何使用 requests_html 异步 get() URL 列表？

我正在尝试使用 python 包resuqests_html异步 get() URL 列表，类似于自述文件中使用 Python 3.6.5 和 requests_html 0.10.0 的异步示例。

我的理解是 AsyncHTMLSession.run() 应该与 asyncio.gather() 非常相似：你给它一堆可等待的东西，它会运行所有的东西。这是不正确的吗？

这是我正在尝试的代码，我希望它应该获取页面并存储响应：

from requests_html import AsyncHTMLSession

async def get_link(url):
    r = await asession.get(url)
    return r

asession = AsyncHTMLSession()
results = asession.run(get_link("http://google.com"), get_link("http://yahoo.com"))

Run Code Online (Sandbox Code Playgroud)

但我却得到了这个异常：

Traceback (most recent call last):
  File "test.py", line 10, in <module>
    results = asession.run(get_link("google.com"), get_link("yahoo.com"))
  File ".\venv\lib\site-packages\requests_html.py", line 772, in run
    asyncio.ensure_future(coro()) for coro in coros
  File ".\venv\lib\site-packages\requests_html.py", line 772, in <listcomp>
    asyncio.ensure_future(coro()) for coro in coros
TypeError: …

Run Code Online (Sandbox Code Playgroud)

python asynchronous python-3.x python-asyncio python-requests-html

ets*_*ner

2019 11-15

2
推荐指数

1
解决办法

3333
查看次数

Python-Requests库-如何确保HTTPS请求

这可能是一个愚蠢的问题，但我只想确认以下内容。

我目前正在 python 中使用 requests 库。我使用它来调用托管在 Azure 云上的外部 API。

如果我使用虚拟机中的请求库，并且请求库发送到 URL：https : //api-management-example/run，这是否意味着我与此 API 的通信以及我发送的整个有效负载安全吗？我在虚拟环境中的 Python 站点包中看到，有一个 cacert.pem 文件。我需要更新吗？我是否需要做其他事情来确保通信安全，或者我调用 HTTPS URL 就意味着它是安全的？

任何信息/指导将不胜感激。

谢谢，

python python-requests python-requests-html

ada*_*n11

lucky-day

2
推荐指数

1
解决办法

7706
查看次数

使用信号量限制并发 AsyncIO 任务数量不起作用

客观的：

我正在尝试同时抓取多个网址。我不想同时发出太多请求，因此我使用此解决方案来限制它。

问题：

正在为所有任务发出请求，而不是一次针对有限数量的任务。

精简代码：

async def download_all_product_information():
    # TO LIMIT THE NUMBER OF CONCURRENT REQUESTS
    async def gather_with_concurrency(n, *tasks):
        semaphore = asyncio.Semaphore(n)

        async def sem_task(task):
            async with semaphore:
                return await task

        return await asyncio.gather(*(sem_task(task) for task in tasks))

    # FUNCTION TO ACTUALLY DOWNLOAD INFO
    async def get_product_information(url_to_append):
        url = 'https://www.amazon.com.br' + url_to_append

        print('Product Information - Page ' + str(current_page_number) + ' for category ' + str(
            category_index) + '/' + str(len(all_categories)) + ' in …

Run Code Online (Sandbox Code Playgroud)

python web-scraping python-asyncio python-requests-html

Jos*_*des

2021 12-23

2
推荐指数

1
解决办法

2838
查看次数

导入错误：无法从“requests_html”导入名称“HTMLSession”

当我尝试使用新模块requests_html使用其网站的示例时，我发现控制台在标题中显示信息。

我已经使用成功安装了requests_htmlpip install requests_html
我已将python更新为python3.7（64位）

控制台的消息：

Traceback (most recent call last):
  File "C:/Users/owlish/PycharmProjects/python34/requests.py", line 2, in <module>
    from requests_html import HTMLSession
  File "C:\Users\owlish\AppData\Local\Programs\Python\Python37\lib\site-packages\requests_html.py", line 10, in <module>
    import requests
  File "C:\Users\owlish\PycharmProjects\python34\requests.py", line 2, in <module>
    from requests_html import HTMLSession
ImportError: cannot import name 'HTMLSession' from 'requests_html' (C:\Users\owlish\AppData\Local\Programs\Python\Python37\lib\site-packages\requests_html.py)

Run Code Online (Sandbox Code Playgroud)

代码：

from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://python.org/')

Run Code Online (Sandbox Code Playgroud)

我希望它能够正常工作，就像示例https://html.python-requests.org/一样。

python-3.x python-requests-html

叶小白*_*叶小白

lucky-day

1
推荐指数

1
解决办法

1万
查看次数

是否可以从 HTTP 请求标头获取客户端的 IP 地址？

我想知道是否可以使用Python从http请求头获取客户端的IP地址？我正在做一个天气项目，如果我能显示他自己所在位置的天气信息那就太好了。

python http request-headers python-requests python-requests-html

作者

2020 05-18

1
推荐指数

1
解决办法

9216
查看次数

如何使用请求跟踪页面重定向

我有这个简单的代码：

import requests
r = requests.get('https://yahoo.com')
print(r.url)

Run Code Online (Sandbox Code Playgroud)

执行后打印：

https://uk.yahoo.com/?p=us

Run Code Online (Sandbox Code Playgroud)

我想看看：

在到达之前发生了多少次重定向https://uk.yahoo.com/?p=us（显然，我最初输入时有重定向https://yahoo.com）？
我还想保存每一页的内容，而不仅仅是最后一页。这个怎么做？

python web-scraping python-3.x python-requests python-requests-html

use*_*654

2019 02-28

-1
推荐指数

1
解决办法

7050
查看次数

标签统计

python-requests-html ×10

python ×8

python-requests ×6

python-3.x ×5

python-asyncio ×2

web-scraping ×2

asynchronous ×1

authentication ×1

http ×1

http-status-code-403 ×1

javascript ×1

request-headers ×1

标签 统计

标签统计