Pet*_*rov 3 python https networking urllib2 web-scraping
我正在创建一个Web抓取脚本并将其分为四个部分.另外它们都完美无缺,但是当我把它们放在一起时我得到以下错误:urlopen错误[Errno 111]连接被拒绝.我已经看过类似的问题,并尝试用try-except来捕捉错误,但即使这样也行不通.我的所有代码都是:
from selenium import webdriver
import re
import urllib2
site = ""
def phone():
global site
site = "https://www." + site
if "spokeo" in site:
browser = webdriver.Firefox()
browser.get(site)
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
elif "addresses" in site:
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
else :
usock = urllib2.urlopen(site)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\s\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
def pipl():
global site
url = "https://pipl.com/search/?q=tom+jones&l=Phoenix%2C+AZ%2C+US&sloc=US|AZ|Phoenix&in=6"
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
r_list = [#re.compile("spokeo.com/[^\s]+"),
re.compile("addresses.com/[^\s]+"),
re.compile("10digits.us/[^\s]+")]
for r in r_list:
match = re.findall(r,data)
for site in match:
site = site[:-6]
print site
phone()
pipl()
Run Code Online (Sandbox Code Playgroud)
这是我的追溯:
Traceback (most recent call last):
File "/home/lazarov/.spyder2/.temp.py", line 48, in <module>
pipl()
File "/home/lazarov/.spyder2/.temp.py", line 46, in pipl
phone()
File "/home/lazarov/.spyder2/.temp.py", line 25, in phone
usock = urllib2.urlopen(site)
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 400, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 418, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1177, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno 111] Connection refused>
Run Code Online (Sandbox Code Playgroud)
手动调试代码后,我发现错误来自函数phone(),所以我试着运行那个:
import re
import urllib2
url = 'http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7'
usock = urllib2.urlopen(url)
data = usock.read()
usock.close()
m_obj = re.search(r"(\d{3}-\d{3}-\d{4})", data)
if m_obj:
print m_obj.group(0)
Run Code Online (Sandbox Code Playgroud)
它奏效了.我相信,这表明防火墙并非主动拒绝连接,或者相应的服务未在其他站点上启动或过载.任何帮助都会被贬低.
通常魔鬼在细节.
根据你的追溯......
File "/usr/lib/python2.7/urllib2.py", line 1215, in https_open
return self.do_open(httplib.HTTPSConnection, req)
Run Code Online (Sandbox Code Playgroud)
和你的源代码......
site = "https://www." + site
Run Code Online (Sandbox Code Playgroud)
...我可能会认为在您尝试访问的代码中,https://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7而在您的测试中,您正在连接到http://www.10digits.us/n/Tom_Jones/Phoenix_AZ/1fe293a0b7.
尝试替换httpswith http(至少为www.10digits.us):可能你试图抓取的网站没有响应端口443而只响应端口80(你甚至可以用你的浏览器检查它)
| 归档时间: |
|
| 查看次数: |
33480 次 |
| 最近记录: |