如何使用scrapy CrawlSpider请求发送cookie?

Par*_*eog 16 python cookies scrapy web-scraping

我正在尝试使用Python的Scrapy框架创建这个Reddit 剪贴板.

我已经使用CrawSpider来爬行Reddit及其subreddits.但是,当我遇到包含成人内容的网页时,该网站会要求提供Cookie over18=1.

所以,我一直在尝试发送一个cookie,其中包含蜘蛛所做的每一个请求,但是,它没有成功.

在这里,是我的蜘蛛代码.正如您所看到的,我尝试使用该start_requests()方法为每个蜘蛛请求添加一个cookie .

这里有人能告诉我怎么做吗?或者我做错了什么?

from scrapy import Spider
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from reddit.items import RedditItem
from scrapy.http import Request, FormRequest

class MySpider(CrawlSpider):
    name = 'redditscraper'
    allowed_domains = ['reddit.com', 'imgur.com']
    start_urls = ['https://www.reddit.com/r/nsfw']

    rules = (
        Rule(LinkExtractor(
            allow=['/r/nsfw/\?count=\d*&after=\w*']),
            callback='parse_item',
            follow=True),
    )

    def start_requests(self):
        for i,url in enumerate(self.start_urls):
            print(url)
            yield Request(url,cookies={'over18':'1'},callback=self.parse_item)

    def parse_item(self, response):
        titleList = response.css('a.title')

        for title in titleList:
            item = RedditItem()
            item['url'] = title.xpath('@href').extract()
            item['title'] = title.xpath('text()').extract()
            yield item
Run Code Online (Sandbox Code Playgroud)

esf*_*sfy 16

好的.尝试做这样的事情.

def start_requests(self):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36'}
    for i,url in enumerate(self.start_urls):
        yield Request(url,cookies={'over18':'1'}, callback=self.parse_item, headers=headers)
Run Code Online (Sandbox Code Playgroud)

这是阻止你的用户代理.

编辑:

不知道有什么问题,CrawlSpiderSpider无论如何都可以工作.

#!/usr/bin/env python
# encoding: utf-8
import scrapy


class MySpider(scrapy.Spider):
    name = 'redditscraper'
    allowed_domains = ['reddit.com', 'imgur.com']
    start_urls = ['https://www.reddit.com/r/nsfw']

    def request(self, url, callback):
        """
         wrapper for scrapy.request
        """
        request = scrapy.Request(url=url, callback=callback)
        request.cookies['over18'] = 1
        request.headers['User-Agent'] = (
            'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, '
            'like Gecko) Chrome/45.0.2454.85 Safari/537.36')
        return request

    def start_requests(self):
        for i, url in enumerate(self.start_urls):
            yield self.request(url, self.parse_item)

    def parse_item(self, response):
        titleList = response.css('a.title')

        for title in titleList:
            item = {}
            item['url'] = title.xpath('@href').extract()
            item['title'] = title.xpath('text()').extract()
            yield item
        url = response.xpath('//a[@rel="nofollow next"]/@href').extract_first()
        if url:
            yield self.request(url, self.parse_item)
        # you may consider scrapy.pipelines.images.ImagesPipeline :D
Run Code Online (Sandbox Code Playgroud)


CTD*_*CTD 5

cra草的文档

1,使用字典

request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'})
Run Code Online (Sandbox Code Playgroud)

2.使用字典列表:

request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])
Run Code Online (Sandbox Code Playgroud)