Syn*_*ter 7 python beautifulsoup web-scraping
我正在尝试从我的项目的Civic Commons Apps链接中提取数据.我能够获得我需要的页面链接.但当我尝试打开链接时,我得到"urlopen错误[Errno -2]名称或服务未知"
网页抓取python代码:
from bs4 import BeautifulSoup
from urlparse import urlparse, parse_qs
import re
import urllib2
import pdb
base_url = "http://civiccommons.org"
url = "http://civiccommons.org/apps"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
list_of_links = []
for link_tag in soup.findAll('a', href=re.compile('^/civic-function.*')):
string_temp_link = base_url+link_tag.get('href')
list_of_links.append(string_temp_link)
list_of_links = list(set(list_of_links))
list_of_next_pages = []
for categorized_apps_url in list_of_links:
categorized_apps_page = urllib2.urlopen(categorized_apps_url)
categorized_apps_soup = BeautifulSoup(categorized_apps_page.read())
last_page_tag = categorized_apps_soup.find('a', title="Go to last page")
if last_page_tag:
last_page_url = base_url+last_page_tag.get('href')
index_value = last_page_url.find("page=") + 5
base_url_for_next_page = last_page_url[:index_value]
for pageno in xrange(0, int(parse_qs(urlparse(last_page_url).query)['page'][0]) + 1):
list_of_next_pages.append(base_url_for_next_page+str(pageno))
else:
list_of_next_pages.append(categorized_apps_url)
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
urllib2.urlopen(categorized_apps_url)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>
Run Code Online (Sandbox Code Playgroud)
当我执行urlopen时,我应该处理任何具体的事情吗?因为我没有看到我得到的http链接有问题.
[edit]第二次运行时出现以下错误:
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1207, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
Run Code Online (Sandbox Code Playgroud)
相同的代码在我朋友的Mac上运行正常,但在我的ubuntu 12.04中失败了.
此外,我尝试在scrapper wiki中运行代码并成功完成.但很少有网址丢失(与mac相比).这些行为有什么理由吗?
该代码可以在我的 Mac 和你朋友的 Mac 上运行。它在 Ubuntu 12.04 服务器的虚拟机实例上运行良好。显然,在你的特定环境中,有一些东西——你的操作系统(Ubuntu 桌面?)或网络导致它崩溃。例如,我的家庭路由器的默认设置会限制 x 秒内对同一域的调用数量 - 如果我不将其关闭,则可能会导致此类问题。这可能有很多原因。
在此阶段,我建议重构您的代码以捕获URLError并保留有问题的网址以进行重试。如果多次重试后失败,也会记录/打印错误。甚至可以添加一些代码来计算错误之间的调用时间。这比让你的脚本直接失败要好,并且你会得到反馈,确定是否只是特定的 url 导致问题或计时问题(即,它是在 x 次调用后失败urlopen,还是在 x 次调用后失败)调用urlopen次数(以 x 微/秒为单位)。如果这是一个计时问题,一个简单的time.sleep(1)插入循环可能会解决问题。