Apr*_*ril 12 twisted scrapy web-scraping scrapy-spider
我正在抓一些页面scrapy并得到以下错误:
twisted.internet.error.ConnectionLost
我的命令行输出:
2015-05-04 18:40:32+0800 [cnproxy] INFO: Spider opened
2015-05-04 18:40:32+0800 [cnproxy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-05-04 18:40:32+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:32+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy3.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy3.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy3.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy8.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy9.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy8.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy8.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy10.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy9.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy9.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy10.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy10.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu1.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu1.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy5.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy7.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy7.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy7.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy5.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy5.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy6.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy6.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:33+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy6.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxyedu2.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:34+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxyedu2.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Retrying <GET http://www.cnproxy.com/proxy4.html> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] DEBUG: Gave up retrying <GET http://www.cnproxy.com/proxy4.html> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] ERROR: Error downloading <GET http://www.cnproxy.com/proxy4.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]
2015-05-04 18:40:35+0800 [cnproxy] INFO: Closing spider (finished)
2015-05-04 18:40:35+0800 [cnproxy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 36,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 36,
'downloader/request_bytes': 8121,
'downloader/request_count': 36,
'downloader/request_method_count/GET': 36,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 5, 4, 10, 40, 35, 608377),
'log_count/DEBUG': 38,
'log_count/ERROR': 12,
'log_count/INFO': 7,
'scheduler/dequeued': 36,
'scheduler/dequeued/memory': 36,
'scheduler/enqueued': 36,
'scheduler/enqueued/memory': 36,
'start_time': datetime.datetime(2015, 5, 4, 10, 40, 32, 624695)}
2015-05-04 18:40:35+0800 [cnproxy] INFO: Spider closed (finished)
Run Code Online (Sandbox Code Playgroud)
我的settings.py:
SPIDER_MODULES = ['proxy.spiders']
NEWSPIDER_MODULES = 'proxy.spiders'
DOWNLOAD_DELAY = 0
DOWNLOAD_TIMEOUT = 30
ITEM_PIPELINES = {
'proxy.pipelines.ProxyPipeline':100,
}
CONCURRENT_ITEMS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 64
#CONCURRENT_SPIDERS = 128
LOG_ENABLED = True
LOG_ENCODING = 'utf-8'
LOG_FILE = '/home/hadoop/modules/scrapy/myapp/proxy/proxy.log'
LOG_LEVEL = 'DEBUG'
LOG_STDOUT = False
Run Code Online (Sandbox Code Playgroud)
我的蜘蛛proxy_spider.py:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from proxy.items import ProxyItem
import re
class ProxycrawlerSpider(CrawlSpider):
name = 'cnproxy'
allowed_domains = ['www.cnproxy.com']
indexes = [1,2,3,4,5,6,7,8,9,10]
start_urls = []
for i in indexes:
url = 'http://www.cnproxy.com/proxy%s.html' % i
start_urls.append(url)
start_urls.append('http://www.cnproxy.com/proxyedu1.html')
start_urls.append('http://www.cnproxy.com/proxyedu2.html')
def parse_ip(self,response):
sel = HtmlXPathSelector(response)
addresses = sel.select('//tr[position()>1]/td[position()=1]').re('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}')
protocols = sel.select('//tr[position()>1]/td[position()=2]').re('<td>(.*)<\/td>')
locations = sel.select('//tr[position()>1]/td[position()=4]').re('<td>(.*)<\/td>')
ports_re = re.compile('write\(":"(.*)\)')
raw_ports = ports_re.findall(response.body);
port_map = {'z':'3','m':'4','k':'2','l':'9','d':'0','b':'5','i':'7','w':'6','r':'8','c':'1','+':''}
ports = []
for port in raw_ports:
tmp = port
for key in port_map:
tmp = tmp.replace(key,port_map[key]);
ports.append(tmp)
items = []
for i in range(len(addresses)):
item = ProxyItem()
item['address'] = addresses[i]
item['protocol'] = protocols[i]
item['location'] = locations[i]
item['port'] = ports[i]
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
我的管道或设置有什么问题吗?如果不是,我该如何防止twisted.internet.error.ConnectionLost错误.
我试过了 scrapy shell
$scrapy shell http://www.cnproxy.com/proxy1.html
Run Code Online (Sandbox Code Playgroud)
并获得与标题相同的错误.但我可以使用我的Chrome访问该页面.我尝试过其他类似的网页
$scrapy shell http://stackoverflow.com
Run Code Online (Sandbox Code Playgroud)
他们都运作良好.
sur*_*190 10
您需要设置用户代理字符串.似乎有些网站不喜欢它,并在您的用户代理不是浏览器时阻止.您可以找到用户代理字符串的示例.
该文章指出了最佳做法,停止你的蜘蛛被阻塞.
打开settings.py:添加以下用户代理
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36'
您还可以尝试用户代理随机数
| 归档时间: |
|
| 查看次数: |
5520 次 |
| 最近记录: |