从脚本中抓取。不会导出数据

12R*_*n12 5 scrapy web-scraping python-2.7 twisted.internet web

我正在尝试从脚本运行 scrapy,但无法让程序创建导出文件

我尝试以两种不同的方式导出文件:

  1. 有管道
  2. 具有饲料出口。

当我从命令行运行 scrapy 时,这两种方法都有效,但当我从脚本运行 scrapy 时,这两种方法都不起作用。

我并不是唯一遇到这个问题的人。这是另外两个类似的未解答的问题。直到我发布问题后我才注意到这些。

  1. 通过Python脚本调用spider时,JSON在scrapy中不起作用?
  2. 从 python 脚本调用 scrapy 不创建 JSON 输出文件

这是我从脚本运行 scrapy 的代码。它包括使用管道和 feed 导出器打印输出文件的设置。

from twisted.internet import reactor

from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging

from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')

def stop_reactor():
    reactor.stop()

dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')
Run Code Online (Sandbox Code Playgroud)

在我运行此代码后,日志显示:“Stored csv feed (341 items) in:output.csv”,但找不到output.csv。

这是我的 Feed 导出器代码:

settings = get_project_settings()

#manually set settings here
settings.set('ITEM_PIPELINES',   {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')


from scrapy.contrib.exporter import CsvItemExporter


class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
        delimiter = settings.get('CSV_DELIMITER', ',')
        kwargs['delimiter'] = delimiter
        super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)

这是我的管道代码:

import csv

class CsvWriterPipeline(object):

def __init__(self):
    self.csvwriter = csv.writer(open('items2.csv', 'wb'))

def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
    self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])

    return item
Run Code Online (Sandbox Code Playgroud)

een*_*agy 3

我有同样的问题。

这是对我有用的:

  1. 将导出uri放入settings.py

    FEED_URI='file:///tmp/feeds/filename.jsonlines'

  2. scrape.py在您旁边创建一个scrapy.cfg包含以下内容的脚本

     
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    process = CrawlerProcess(get_project_settings())
    
    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
    process.start() # the script will block here until the crawling is finished
    
    
    Run Code Online (Sandbox Code Playgroud)
  3. 跑步 :python scrape.py

结果:文件已创建。

注意:我的项目中没有管道。因此不确定管道是否会过滤您的结果。

另外:以下是对我有帮助的文档中的常见陷阱部分