YSY*_*YSY 7 python exception urllib2 errno web-crawler
我编写了一个使用urllib2来获取网址的抓取工具.
每一个请求我得到一些奇怪的行为,我尝试用wireshark分析它,无法理解问题.
getPAGE()负责获取url.如果成功获取url,则返回url(response.read())的内容,否则返回None.
def getPAGE(FetchAddress):
attempts = 0
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0'}
while attempts < 2:
req = Request(FetchAddress, None ,headers)
try:
response = urlopen(req) #fetching the url
except HTTPError, e:
print 'The server didn\'t do the request.'
print 'Error code: ', str(e.code) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except URLError, e:
print 'Failed to reach the server.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
except Exception, e:
print 'Something bad happened in gatPAGE.'
print 'Reason: ', str(e.reason) + " address: " + FetchAddress
time.sleep(4)
attempts += 1
else:
return response.read()
return None
Run Code Online (Sandbox Code Playgroud)
这是调用getPAGE()的函数,并检查我获取的页面是否有效(使用 - companyID = soup.find('span',id ='lblCompanyNumber')检查.string #if companyID is None the page无效),如果页面有效,它会将汤对象保存到名为'curRes'的全局变量中.
def isValid(ID):
global curRes
try:
address = urlPath+str(ID)
page = getPAGE(address)
if page == None:
saveToCsv(ID, badRequest = True)
return False
except Exception, e:
print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
else:
try:
soup = BeautifulSoup(page)
except TypeError, e:
print "An error occured in the second Exception block of parseHTML : " + str(e) +' address: ' + address
return False
try:
companyID = soup.find('span',id='lblCompanyNumber').string
if (companyID == None): #if lblCompanyNumber is None we can assume that we don't have the content we want, save in the bad log file
saveToCsv(ID, isEmpty = True)
return False
else:
curRes = soup #we have the data we need, save the soup obj to a global variable
return True
except Exception, e:
print "Error while parsing this page, third exception block: " + str(e) + ' id: ' + address
return False
Run Code Online (Sandbox Code Playgroud)
奇怪的行为是 -
解析此页面时出错,第三个异常块:'NoneType'对象没有属性'string'id:....
这很奇怪,因为我正在创建汤对象,就好像我从getPAGE()获得了一个有效的结果,而且似乎汤函数返回None,每当我尝试运行时都会引发异常
companyID = soup.find('span',id ='lblCompanyNumber').string
汤对象永远不应该是None,它应该从getPAGE()获得HTML,如果它到达代码的那一部分
我已经检查过并发现问题以某种方式连接到第一个问题(发送GET而不是等待回复,我看到(在WireShark上)每次我得到该异常时,urlib2发送了一个GET请求的url但是没有等待响应并继续前进,getPAGE()应该为该url返回None,但是如果它将返回None isValid(ID)将不会通过"if page == None:"条件,我可以要弄清楚它为什么会发生,重复这个问题是不可能的.
我已经读过time.sleep()会导致urllib2线程出现问题,所以也许我应该避免使用它?
为什么urllib2总是不等待响应(很少发生它不等待)?
我该怎么办"[errno 10054]现有连接被远程主机强行关闭"错误?BTW - getPAGE()尝试不会捕获异常:除了块之外,它被第一个isValid()尝试捕获:except:block,这也很奇怪,因为getPAGE()假设捕获它抛出的所有异常.
try:
address = urlPath+str(ID)
page = getPAGE(address)
if page == None:
saveToCsv(ID, badRequest = True)
return False
except Exception, e:
print "An error occured in the first Exception block of parseHTML : " + str(e) +' address: ' + address
Run Code Online (Sandbox Code Playgroud)
谢谢!
| 归档时间: |
|
| 查看次数: |
8104 次 |
| 最近记录: |