gra*_*per 7 python scrapy web-scraping
所以我有一个scrapy程序,我试图开始,但我无法让我的代码执行它总是出现以下错误.
我仍然可以使用scrapy shell
命令访问该网站,所以我知道Url和所有工作.
这是我的代码
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Malscraper.items import MalItem
class MalSpider(CrawlSpider):
name = 'Mal'
allowed_domains = ['www.website.net']
start_urls = ['http://www.website.net/stuff.php?']
rules = [
Rule(LinkExtractor(
allow=['//*[@id="content"]/div[2]/div[2]/div/span/a[1]']),
callback='parse_item',
follow=True)
]
def parse_item(self, response):
mal_list = response.xpath('//*[@id="content"]/div[2]/table/tr/td[2]/')
for mal in mal_list:
item = MalItem()
item['name'] = mal.xpath('a[1]/strong/text()').extract_first()
item['link'] = mal.xpath('a[1]/@href').extract_first()
yield item
Run Code Online (Sandbox Code Playgroud)
编辑:这是跟踪.
Traceback (most recent call last):
File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open
response = self._open(req, data)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open
'_open', req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain
result = func(*args)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
Run Code Online (Sandbox Code Playgroud)
EDIT2:
因此,通过scrapy,shell command
我能够操纵我的回复,但我只是注意到访问该网站时再次出现同样的错误
EDIT3:
我现在发现错误出现在我使用的每个网站shell command
上,但我仍然可以操纵响应.
编辑4:那么如何验证我至少在运行时收到Scrapy的回复crawl command
?现在我不知道我的代码是否是我的日志变空或错误的原因?
这是我的settings.py
BOT_NAME = 'Malscraper'
SPIDER_MODULES = ['Malscraper.spiders']
NEWSPIDER_MODULE = 'Malscraper.spiders'
FEED_URI = 'logs/%(name)s/%(time)s.csv'
FEED_FORMAT = 'csv'
Run Code Online (Sandbox Code Playgroud)
Jos*_*rdo 18
这个问题有一个开放的scrapy问题:https://github.com/scrapy/scrapy/issues/1054
虽然它似乎只是其他平台的警告.
您可以通过添加到scrapy设置来禁用S3DownloadHandler(导致此错误):
DOWNLOAD_HANDLERS = {
's3': None,
}
Run Code Online (Sandbox Code Playgroud)
您还可以boto
从可选包中删除添加:
from scrapy import optional_features
optional_features.remove('boto')
Run Code Online (Sandbox Code Playgroud)
正如本期所述