连接错误,当我在 debian 上使用 Python 进行网络抓取时超时

Mr.*_*r.D 2 python browser debian try-catch web-scraping

我有一个网页抓取脚本,它正在处理数千个链接。但有时我会收到连接错误、超时错误、网关错误错误,而我的脚本只是停止..

这是我的部分代码(在 url 中,我得到了我循环运行的链接):

def scrape(urls):
    browser = webdriver.Firefox()
    datatable=[]
    for url in urls:
        browser.get(url)
        html = browser.page_source
        soup=BeautifulSoup(html,"html.parser")
        table = soup.find('table', { "class" : "table table-condensed table-hover data-table m-n-t-15" })
Run Code Online (Sandbox Code Playgroud)

我想我必须使用 try-catch 方法来避免它,如果它发生了,请再试一次阅读这个网站。

我的问题是我必须在我的代码中构建的位置和内容,以捕获这些错误并重试/转到下一个链接?

try:
    r = requests.get(url, params={'s': thing})
except requests.exceptions.RequestException:
    # what i have to write plus and where i have to place correctly this part?
Run Code Online (Sandbox Code Playgroud)

谢谢!

And*_*Guy 5

当我之前处理过这些类型的错误时,我编写了一个装饰器,如果引发给定的异常,它将重试函数调用一定次数。

from functools import wraps
import time
from requests.exceptions import RequestException
from socket import timeout

class Retry(object):
    """Decorator that retries a function call a number of times, optionally
    with particular exceptions triggering a retry, whereas unlisted exceptions
    are raised.
    :param pause: Number of seconds to pause before retrying
    :param retreat: Factor by which to extend pause time each retry
    :param max_pause: Maximum time to pause before retry. Overrides pause times
                      calculated by retreat.
    :param cleanup: Function to run if all retries fail. Takes the same
                    arguments as the decorated function.
    """
    def __init__(self, times, exceptions=(IndexError), pause=1, retreat=1,
                 max_pause=None, cleanup=None):
        """Initiliase all input params"""
        self.times = times
        self.exceptions = exceptions
        self.pause = pause
        self.retreat = retreat
        self.max_pause = max_pause or (pause * retreat ** times)
        self.cleanup = cleanup

    def __call__(self, f):
        """
        A decorator function to retry a function (ie API call, web query) a
        number of times, with optional exceptions under which to retry.

        Returns results of a cleanup function if all retries fail.
        :return: decorator function.
        """
        @wraps(f)
        def wrapped_f(*args, **kwargs):
            for i in range(self.times):
                # Exponential backoff if required and limit to a max pause time
                pause = min(self.pause * self.retreat ** i, self.max_pause)
                try:
                    return f(*args, **kwargs)
                except self.exceptions:
                    if self.pause is not None:
                        time.sleep(pause)
                    else:
                        pass
            if self.cleanup is not None:
                return self.cleanup(*args, **kwargs)
        return wrapped_f
Run Code Online (Sandbox Code Playgroud)

您可以创建一个函数来处理失败的调用(在最大重试之后):

def failed_call(*args, **kwargs):
    """Deal with a failed call within various web service calls.
    Will print to a log file with details of failed call.
    """
    print("Failed call: " + str(args) + str(kwargs))
    # Don't have to raise this here if you don't want to.
    # Would be used if you want to do some other try/except error catching.
    raise RequestException
Run Code Online (Sandbox Code Playgroud)

制作一个类实例来装饰你的函数调用:

#Class instance to use as a retry decorator
retry = Retry(times=5, pause=1, retreat=2, cleanup=failed_call,
              exceptions=(RequestException, timeout))
Run Code Online (Sandbox Code Playgroud)

使用retreat=2,第一次重试将在 1 秒后发生,第二次重试将在 2 秒后发生,第三次在 4 秒后重试,以此类推。

并定义你的函数来抓取一个网站,用你的重试装饰器装饰:

@retry
def scrape_a_site(url, params):
    r = requests.get(url, params=params)
    return r
Run Code Online (Sandbox Code Playgroud)

请注意,您可以轻松设置哪些异常将触发重试。我用过RequestExceptiontimeout这里。适应你的情况。

关于您的代码,您可以将其调整为这样的内容(使用上面的第一块代码定义了您的装饰器):

#Class instance to use as a retry decorator
retry = Retry(times=5, pause=1, retreat=2, cleanup=None,
          exceptions=(RequestException, timeout))

@retry
def get_html(browser, url):
    '''Get HTML from url'''
    browser.get(url)
    return browser.page_source

def scrape(urls):
    browser = webdriver.Firefox()
    datatable=[]
    for url in urls:
        html = get_html(browser, url)
        soup=BeautifulSoup(html,"html.parser")
        table = soup.find('table', { "class" : "table table-condensed table-hover data-table m-n-t-15" })
Run Code Online (Sandbox Code Playgroud)

请注意,您正在应用@retry最小的代码块(只是 Web 查找逻辑)。