Mr.*_*r.D 2 python browser debian try-catch web-scraping
我有一个网页抓取脚本,它正在处理数千个链接。但有时我会收到连接错误、超时错误、网关错误错误,而我的脚本只是停止..
这是我的部分代码(在 url 中,我得到了我循环运行的链接):
def scrape(urls):
browser = webdriver.Firefox()
datatable=[]
for url in urls:
browser.get(url)
html = browser.page_source
soup=BeautifulSoup(html,"html.parser")
table = soup.find('table', { "class" : "table table-condensed table-hover data-table m-n-t-15" })
Run Code Online (Sandbox Code Playgroud)
我想我必须使用 try-catch 方法来避免它,如果它发生了,请再试一次阅读这个网站。
我的问题是我必须在我的代码中构建的位置和内容,以捕获这些错误并重试/转到下一个链接?
try:
r = requests.get(url, params={'s': thing})
except requests.exceptions.RequestException:
# what i have to write plus and where i have to place correctly this part?
Run Code Online (Sandbox Code Playgroud)
谢谢!
当我之前处理过这些类型的错误时,我编写了一个装饰器,如果引发给定的异常,它将重试函数调用一定次数。
from functools import wraps
import time
from requests.exceptions import RequestException
from socket import timeout
class Retry(object):
"""Decorator that retries a function call a number of times, optionally
with particular exceptions triggering a retry, whereas unlisted exceptions
are raised.
:param pause: Number of seconds to pause before retrying
:param retreat: Factor by which to extend pause time each retry
:param max_pause: Maximum time to pause before retry. Overrides pause times
calculated by retreat.
:param cleanup: Function to run if all retries fail. Takes the same
arguments as the decorated function.
"""
def __init__(self, times, exceptions=(IndexError), pause=1, retreat=1,
max_pause=None, cleanup=None):
"""Initiliase all input params"""
self.times = times
self.exceptions = exceptions
self.pause = pause
self.retreat = retreat
self.max_pause = max_pause or (pause * retreat ** times)
self.cleanup = cleanup
def __call__(self, f):
"""
A decorator function to retry a function (ie API call, web query) a
number of times, with optional exceptions under which to retry.
Returns results of a cleanup function if all retries fail.
:return: decorator function.
"""
@wraps(f)
def wrapped_f(*args, **kwargs):
for i in range(self.times):
# Exponential backoff if required and limit to a max pause time
pause = min(self.pause * self.retreat ** i, self.max_pause)
try:
return f(*args, **kwargs)
except self.exceptions:
if self.pause is not None:
time.sleep(pause)
else:
pass
if self.cleanup is not None:
return self.cleanup(*args, **kwargs)
return wrapped_f
Run Code Online (Sandbox Code Playgroud)
您可以创建一个函数来处理失败的调用(在最大重试之后):
def failed_call(*args, **kwargs):
"""Deal with a failed call within various web service calls.
Will print to a log file with details of failed call.
"""
print("Failed call: " + str(args) + str(kwargs))
# Don't have to raise this here if you don't want to.
# Would be used if you want to do some other try/except error catching.
raise RequestException
Run Code Online (Sandbox Code Playgroud)
制作一个类实例来装饰你的函数调用:
#Class instance to use as a retry decorator
retry = Retry(times=5, pause=1, retreat=2, cleanup=failed_call,
exceptions=(RequestException, timeout))
Run Code Online (Sandbox Code Playgroud)
使用retreat=2,第一次重试将在 1 秒后发生,第二次重试将在 2 秒后发生,第三次在 4 秒后重试,以此类推。
并定义你的函数来抓取一个网站,用你的重试装饰器装饰:
@retry
def scrape_a_site(url, params):
r = requests.get(url, params=params)
return r
Run Code Online (Sandbox Code Playgroud)
请注意,您可以轻松设置哪些异常将触发重试。我用过RequestException和timeout这里。适应你的情况。
关于您的代码,您可以将其调整为这样的内容(使用上面的第一块代码定义了您的装饰器):
#Class instance to use as a retry decorator
retry = Retry(times=5, pause=1, retreat=2, cleanup=None,
exceptions=(RequestException, timeout))
@retry
def get_html(browser, url):
'''Get HTML from url'''
browser.get(url)
return browser.page_source
def scrape(urls):
browser = webdriver.Firefox()
datatable=[]
for url in urls:
html = get_html(browser, url)
soup=BeautifulSoup(html,"html.parser")
table = soup.find('table', { "class" : "table table-condensed table-hover data-table m-n-t-15" })
Run Code Online (Sandbox Code Playgroud)
请注意,您正在应用@retry最小的代码块(只是 Web 查找逻辑)。