gdo*_*371 7 python http scrapy
我在Windows Vista 64位上使用Python.org版本2.7 64位.我一直在测试以下Scrapy代码,以递归方式抓取网站www.whoscored.com上的所有页面,这是用于足球统计:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
Run Code Online (Sandbox Code Playgroud)
代码正在执行而没有任何错误,但是4623页面被抓取,217得到了200的HTTP响应代码,2得到了302的代码,4404得到了403响应.任何人都可以在代码中看到任何直接明显的原因,为什么会这样?这可能是来自网站的反刮痧措施吗?通常的做法是减慢提交的数量以阻止这种情况发生吗?
谢谢
如果这仍然可用,我不会,但我必须将下一行放在setting.py文件中:
HTTPERROR_ALLOWED_CODES =[404]
USER_AGENT = 'quotesbot (+http://www.yourdomain.com)'
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
Run Code Online (Sandbox Code Playgroud)
希望能帮助到你。