SIM*_*SIM 9 python web-scraping python-3.x python-requests python-requests-html
我在python中编写了一个脚本,以便从javascript呈现的网页中获取最后一笔交易的价格.我可以得到内容如果我选择去selenium
.我的目标是不使用任何类似的浏览器模拟器,selenium
因为最新版本的Requests-HTML应该具有解析javascript加密内容的能力.但是,我无法顺利完成任务.当我运行脚本时,我收到以下错误.任何有关这方面的帮助将受到高度赞赏.
网站地址:webpage_link
我尝试过的脚本:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.gdax.com/trade/LTC-EUR')
js = r.html.render()
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
print(item)
Run Code Online (Sandbox Code Playgroud)
这是完整的追溯:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'
Run Code Online (Sandbox Code Playgroud)
我所追求的价格可以在页面顶部找到,这样可以看到177.59 EUR Last trade price
.我希望得到177.59
或当前的价格.
Mar*_*ers 14
你有几个错误.第一个是"导航"超时,显示页面未完成渲染:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Run Code Online (Sandbox Code Playgroud)
主线程中不会引发此回溯,因此您的代码未中止.您的页面可能已完成,也可能未完成; 您可能希望为浏览器设置更长的超时或引入睡眠周期,以便有时间处理AJAX响应.
接下来,response.html.render()
元素返回None
.它加载HTML到一具无头Chromium浏览器,JavaScript的树叶呈现给浏览器,然后复制回HTML页面到response.html
datasctructure 到位,并没有什么需要返回.因此js
设置为None
,而不是新HTML
实例,导致您的下一个回溯.
渲染后使用现有 response.html
对象进行搜索:
r.html.render()
item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)
Run Code Online (Sandbox Code Playgroud)
目前最有可能没有这样的CSS类,因为最后5个字符生成每个页面上呈现,JSON数据负荷在AJAX后.这使得很难使用CSS来查找有问题的元素.
此外,我发现没有睡眠周期,浏览器就没有时间获取AJAX资源并呈现您想要加载的信息.比如说,sleep
在复制HTML之前要做10秒钟的工作.如果看到网络超时,请设置更长的超时(默认为8秒):
r.html.render(timeout=10, sleep=10)
Run Code Online (Sandbox Code Playgroud)
您也可以设置timeout
为0
,删除超时并无限期地等待页面加载.
希望未来的API更新还提供等待网络活动停止的功能.
您可以使用包含的parse
库来查找匹配的CSS类:
# search for CSS suffixes
suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
for suffix in suffixes:
# for each suffix, find all matching elements with that class
items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
for item in items:
print(item.text)
Run Code Online (Sandbox Code Playgroud)
现在我们得到输出:
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
Run Code Online (Sandbox Code Playgroud)
您的上一次回溯显示无法清除Chromium用户数据路径.底层的Pyppeteer库使用临时用户数据路径配置无头Chromium浏览器,在您的情况下,该目录包含一些仍然锁定的资源.您可以忽略该错误,但您可能希望稍后尝试删除该.pyppeteer
文件夹中的所有剩余文件.
归档时间: |
|
查看次数: |
2858 次 |
最近记录: |