27 python mechanize beautifulsoup web-scraping python-2.7
我想从网站上获取一些数据.然而,它回报了我incomplete read.我想要获取的数据是一组庞大的嵌套链接.我在网上进行了一些研究,发现这可能是由于服务器错误(在达到预期大小之前完成了一个分块传输编码).我还在此链接上找到了上面的解决方法
但是,我不确定如何在我的情况下使用它.以下是我正在处理的代码
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1;Trident/5.0)')]
urls = "http://shop.o2.co.uk/mobile_phones/Pay_Monthly/smartphone/all_brands"
page = urllib2.urlopen(urls).read()
soup = BeautifulSoup(page)
links = soup.findAll('img',url=True)
for tag in links:
name = tag['alt']
tag['url'] = urlparse.urljoin(urls, tag['url'])
r = br.open(tag['url'])
page_child = br.response().read()
soup_child = BeautifulSoup(page_child)
contracts = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "tariff-duration"})]
data_usage = [tag_c['value']for tag_c in soup_child.findAll('input', {"name": "allowance"})]
print contracts
print data_usage
Run Code Online (Sandbox Code Playgroud)
请帮帮我.谢谢
Kyl*_*yle 23
您在问题中包含的链接只是一个执行urllib的read()函数的包装器,它会捕获任何不完整的读取异常.如果你不想实现这个整个补丁,你可以随时抛出一个try/catch循环来读取你的链接.例如:
try:
page = urllib2.urlopen(urls).read()
except httplib.IncompleteRead, e:
page = e.partial
Run Code Online (Sandbox Code Playgroud)
对于python3
try:
page = request.urlopen(urls).read()
except (http.client.IncompleteRead) as e:
page = e.partial
Run Code Online (Sandbox Code Playgroud)
我发现在我的情况下:发送HTTP/1.0请求,添加此,修复问题.
import httplib
httplib.HTTPConnection._http_vsn = 10
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.0'
Run Code Online (Sandbox Code Playgroud)
在我提出要求后:
req = urllib2.Request(url, post, headers)
filedescriptor = urllib2.urlopen(req)
img = filedescriptor.read()
Run Code Online (Sandbox Code Playgroud)
在我回到http 1.1后(对于支持1.1的连接):
httplib.HTTPConnection._http_vsn = 11
httplib.HTTPConnection._http_vsn_str = 'HTTP/1.1'
Run Code Online (Sandbox Code Playgroud)
诀窍是使用http 1.0而不是默认的http/1.1 http 1.1可以处理块但由于某种原因webserver没有,所以我们在http 1.0中执行请求
小智 8
对我有用的是捕获 IncompleteRead 作为异常,并通过将其放入如下所示的循环中来收集您在每次迭代中设法读取的数据:(注意,我使用的是 Python 3.4.1,urllib 库已在 2.7 和 3.4 之间更改)
try:
requestObj = urllib.request.urlopen(url, data)
responseJSON=""
while True:
try:
responseJSONpart = requestObj.read()
except http.client.IncompleteRead as icread:
responseJSON = responseJSON + icread.partial.decode('utf-8')
continue
else:
responseJSON = responseJSON + responseJSONpart.decode('utf-8')
break
return json.loads(responseJSON)
except Exception as RESTex:
print("Exception occurred making REST call: " + RESTex.__str__())
Run Code Online (Sandbox Code Playgroud)