如何在scrapy中覆盖/使用cookie

Mah*_*tah 14 python scrapy

我想废弃http://www.3andena.com/,这个网站首先用阿拉伯语开始,它将语言设置存储在cookie中.如果您尝试直接通过URL(http://www.3andena.com/home.php?sl=en)访问语言版本,则会出现问题并返回服务器错误.

所以,我想将cookie值"store_language"设置为"en",然后使用此cookie值开始废弃网站.

我正在使用带有几个规则的CrawlSpider.

这是代码

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re

class AndenaSpider(CrawlSpider):
  name = "andena"
  domain_name = "3andena.com"
  start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]

  product_urls = []

  rules = (
     # The following rule is for pagination
     Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
     # The following rule is for produt details
     Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
     )

  def start_requests(self):
    yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})

    for url in self.start_urls:
        yield Request(url, callback=self.parse_category)


  def parse_category(self, response):
    hxs = HtmlXPathSelector(response)

    self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())

    for product in self.product_urls:
        yield Request(product, callback=self.parse_product)  


  def parse_product(self, response):
    hxs = HtmlXPathSelector(response)
    items = []
    item = Product()

    '''
    some parsing
    '''

    items.append(item)
    return items

SPIDER = AndenaSpider()
Run Code Online (Sandbox Code Playgroud)

这是日志:

2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
Run Code Online (Sandbox Code Playgroud)

小智 11

修改您的代码如下:

def start_requests(self):
    for url in self.start_urls:
        yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)
Run Code Online (Sandbox Code Playgroud)

Scrapy.Request对象接受可选的cookies关键字参数,请参阅此处的文档


Lou*_*uis 7

这就是我在Scrapy 0.24.6中的表现:

from scrapy.contrib.spiders import CrawlSpider, Rule

class MySpider(CrawlSpider):

    ...

    def make_requests_from_url(self, url):
        request = super(MySpider, self).make_requests_from_url(url)
        request.cookies['foo'] = 'bar'
        return request
Run Code Online (Sandbox Code Playgroud)

Scrapy调用蜘蛛属性中的make_requests_from_urlURL start_urls.上面的代码是让默认实现创建请求,然后添加foo具有该值的cookie bar.(或者将cookie更改为值,bar如果发生这种情况,则无论如何,foo默认实现产生的请求已经存在cookie.)

如果您想知道创建的请求会发生什么,请start_urls允许我补充一点,Scrapy的cookie中间件将记住使用上述代码设置的cookie,并将其设置在与您明确添加的请求共享同一域的所有未来请求中你的饼干.


Ven*_*atH 4

直接来自Scrapy 请求和响应文档。

你需要这样的东西

request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'})
Run Code Online (Sandbox Code Playgroud)