scrapy - 每个 starurl 的单独输出文件

Niv*_*ius 2 python scrapy web-scraping python-3.x scrapy-spider

我有一个运行良好的爬虫蜘蛛:

`# -*- coding: utf-8 -*-
import scrapy


class AllCategoriesSpider(scrapy.Spider):
    name = 'vieles'
    allowed_domains = ['examplewiki.de']
    start_urls = ['http://www.exampleregelwiki.de/index.php/categoryA.html','http://www.exampleregelwiki.de/index.php/categoryB.html','http://www.exampleregelwiki.de/index.php/categoryC.html',]

#"Titel": :

def parse(self, response):
    urls = response.css('a.ulSubMenu::attr(href)').extract() # links to den subpages
    for url in urls:
        url = response.urljoin(url)
        yield scrapy.Request(url=url,callback=self.parse_details)

def parse_details(self,response):
    yield {
        "Titel": response.css("li.active.last::text").extract(),
        "Content": response.css('div.ce_text.first.last.block').extract(),
    }
Run Code Online (Sandbox Code Playgroud)

` 与

scrapy runningpider spider.py -o dat.json 它将所有信息保存到 dat.json

我希望每个起始 url 都有一个输出文件 categoryA.json categoryB.json 等等。

一个类似的问题没有得到解答,我无法重现这个答案,我也无法从那里建议中学习。

我如何实现拥有多个输出文件的目标,每个 starturl 一个?我只想运行一个命令/shellscript/文件来实现这一点。

fur*_*ras 5

您没有在代码中使用真实的 url,所以我使用我的页面进行测试。
我必须更改 css 选择器,并且使用了不同的字段。

我将它保存为csv因为它更容易附加数据。
JSON需要从文件中读取所有项目,添加新项目并将所有项目再次保存在同一文件中。


我创建了额外的字段Category,稍后将其用作管道中的文件名

项目.py

import scrapy

class CategoryItem(scrapy.Item):
    Title = scrapy.Field()
    Date = scrapy.Field()
    # extra field use later as filename 
    Category = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)

在蜘蛛中,我从 url 获取类别并发送到parse_details使用metain Request
parse_details我添加categoryItem.

蜘蛛/example.py

import scrapy

class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['blog.furas.pl']
    start_urls = ['http://blog.furas.pl/category/python.html','http://blog.furas.pl/category/html.html','http://blog.furas.pl/category/linux.html']

    def parse(self, response):

        # get category from url
        category = response.url.split('/')[-1][:-5]

        urls = response.css('article a::attr(href)').extract() # links to den subpages

        for url in urls:
            # skip some urls
            if ('/tag/' not in url) and ('/category/' not in url):
                url = response.urljoin(url)
                # add category (as meta) to send it to callback function
                yield scrapy.Request(url=url, callback=self.parse_details, meta={'category': category})

    def parse_details(self, response):

        # get category
        category = response.meta['category']

        # get only first title (or empty string '') and strip it
        title = response.css('h1.entry-title a::text').extract_first('')
        title = title.strip()

        # get only first date (or empty string '') and strip it
        date = response.css('.published::text').extract_first('')
        date = date.strip()

        yield {
            'Title': title,
            'Date': date,
            'Category': category,
        }
Run Code Online (Sandbox Code Playgroud)

在管道中,我获取category并使用它来打开文件以追加和保存项目。

管道.py

import csv

class CategoryPipeline(object):

    def process_item(self, item, spider):

        # get category and use it as filename
        filename = item['Category'] + '.csv'

        # open file for appending
        with open(filename, 'a') as f:
            writer = csv.writer(f)

            # write only selected elements 
            row = [item['Title'], item['Date']]
            writer.writerow(row)

            #write all data in row
            #warning: item is dictionary so item.values() don't have to return always values in the same order
            #writer.writerow(item.values())

        return item
Run Code Online (Sandbox Code Playgroud)

在设置中,我必须取消注释管道才能激活它。

设置.py

ITEM_PIPELINES = {
    'category.pipelines.CategoryPipeline': 300,
}
Run Code Online (Sandbox Code Playgroud)

GitHub 上的完整代码:python-examples/scrapy/save-categories-in-separated-files


顺便说一句:我认为您可以直接在parse_details.