Scrapy:连接被拒绝

and*_*ers 6 python scrapy web-scraping

我在尝试测试scrapy安装时收到错误:

$ scrapy shell http://www.google.es
j2011-02-16 10:54:46+0100 [scrapy] INFO: Scrapy 0.12.0.2536 started (bot: scrapybot)
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Enabled extensions: TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, CloseSpider
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Enabled scheduler middlewares: DuplicatesFilterMiddleware
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, HttpCompressionMiddleware, DownloaderStats
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Enabled item pipelines: 
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-02-16 10:54:46+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-02-16 10:54:46+0100 [default] INFO: Spider opened
2011-02-16 10:54:47+0100 [default] DEBUG: Retrying <GET http://www.google.es> (failed 1 times): Connection was refused by other side: 111: Connection refused.
2011-02-16 10:54:47+0100 [default] DEBUG: Retrying <GET http://www.google.es> (failed 2 times): Connection was refused by other side: 111: Connection refused.
2011-02-16 10:54:47+0100 [default] DEBUG: Discarding <GET http://www.google.es> (failed 3 times): Connection was refused by other side: 111: Connection refused.
2011-02-16 10:54:47+0100 [default] ERROR: Error downloading <http://www.google.es>: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionRefusedError'>: Connection was refused by other side: 111: Connection refused.
    ]
2011-02-16 10:54:47+0100 [scrapy] ERROR: Shell error
    Traceback (most recent call last):
    Failure: scrapy.exceptions.IgnoreRequest: Connection was refused by other side: 111: Connection refused.

2011-02-16 10:54:47+0100 [default] INFO: Closing spider (shutdown)
2011-02-16 10:54:47+0100 [default] INFO: Spider closed (shutdown)
Run Code Online (Sandbox Code Playgroud)

版本:

  • Scrapy 0.12.0.2536
  • Python 2.6.6
  • 操作系统:Ubuntu 10.10

编辑:我可以通过我的浏览器,wget,telnet google.es 80与它达成它,它发生在所有网站上.

niz*_*.sp 9

任务1:Scrapy会发送一个带有'bot'的紧急情况.站点也可能基于用户代理阻止.

尝试在settings.py中覆盖USER_AGENT

例如: USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:7.0.1) Gecko/20100101 Firefox/7.7'

任务2:尝试在请求之间延迟,欺骗人类发送请求.

DOWNLOAD_DELAY = 0.25 
Run Code Online (Sandbox Code Playgroud)

任务3:如果无效,请安装wireshark,并在scrapy发送和浏览器发送时查看请求标题(或)发布数据的差异.