Python urllib2和[errno 10054]远程主机强行关闭现有连接和一些urllib2问题

YSY*_*YSY 7 python exception urllib2 errno web-crawler

我编写了一个使用urllib2来获取网址的抓取工具.

每一个请求我得到一些奇怪的行为,我尝试用wireshark分析它,无法理解问题.

getPAGE()负责获取url.如果成功获取url,则返回url(response.read())的内容,否则返回None.

def getPAGE(FetchAddress):
    attempts = 0
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
    while attempts < 2:
        req = Request(FetchAddress, None ,headers)
        try:
            response = urlopen(req) #fetching the url
        except HTTPError, e:
            print 'The server didn\'t do the request.'
            print 'Error code: ', str(e.code) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except URLError, e:
            print 'Failed to reach the server.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        except Exception, e:
            print 'Something bad happened in gatPAGE.'
            print 'Reason: ', str(e.reason) + "  address: " + FetchAddress
            time.sleep(4)
            attempts += 1
        else:
            return response.read()
    return None
Run Code Online (Sandbox Code Playgroud)

这是调用getPAGE()的函数,并检查我获取的页面是否有效(使用 - companyID = soup.find('span',id ='lblCompanyNumber')检查.string #if companyID is None the page无效),如果页面有效,它会将汤对象保存到名为'curRes'的全局变量中.

def isValid(ID):
    global curRes
    try:
        address = urlPath+str(ID)
        page = getPAGE(address)
        if page == None:
            saveToCsv(ID, badRequest = True)
            return False
    except Exception, e:
        print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
    else:
        try:
            soup = BeautifulSoup(page)
        except TypeError, e:
            print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
            return False
        try:
            companyID = soup.find('span',id='lblCompanyNumber').string
            if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
                saveToCsv(ID, isEmpty = True)
                return False
            else:
                curRes = soup #we have the data we need, save the soup obj to a global variable
                return True
        except Exception, e:
            print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
            return False
Run Code Online (Sandbox Code Playgroud)

奇怪的行为是 -

  1. 有时urllib2执行GET请求而不等待回复它发送下一个GET请求(忽略最后一个请求)
  2. 有时我得到" [errno 10054]现有连接被远程主机强行关闭"代码被卡住大约20分钟左右等待服务器的响应,而它卡住我复制网址并尝试获取它是手动的,我在不到1秒的时间内得到了响应(?).
  3. getPAGE()函数将返回None到isValid()如果它无法返回url,有时我得到错误 -

解析此页面时出错,第三个异常块:'NoneType'对象没有属性'string'id:....

这很奇怪,因为我正在创建汤对象,就好像我从getPAGE()获得了一个有效的结果,而且似乎汤函数返回None,每当我尝试运行时都会引发异常

companyID = soup.find('span',id ='lblCompanyNumber').string

汤对象永远不应该是None,它应该从getPAGE()获得HTML,如果它到达代码的那一部分

我已经检查过并发现问题以某种方式连接到第一个问题(发送GET而不是等待回复,我看到(在WireShark上)每次我得到该异常时,urlib2发送了一个GET请求的url但是没有等待响应并继续前进,getPAGE()应该为该url返回None,但是如果它将返回None isValid(ID)将不会通过"if page == None:"条件,我可以要弄清楚它为什么会发生,重复这个问题是不可能的.

我已经读过time.sleep()会导致urllib2线程出现问题,所以也许我应该避免使用它?

为什么urllib2总是不等待响应(很少发生它不等待)?

我该怎么办"[errno 10054]现有连接被远程主机强行关闭"错误?BTW - getPAGE()尝试不会捕获异常:除了块之外,它被第一个isValid()尝试捕获:except:block,这也很奇怪,因为getPAGE()假设捕获它抛出的所有异常.

try:
    address = urlPath+str(ID)
    page = getPAGE(address)
    if page == None:
        saveToCsv(ID, badRequest = True)
        return False
except Exception, e:
    print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
Run Code Online (Sandbox Code Playgroud)

谢谢!