Mat*_*mik 3 python csv scrapy web-scraping scrapy-spider
我有一个问题,使用Scrapy输出内报价。我想包含逗号,这导致在像这样的一些列双引号废料数据:
TEST,TEST,TEST,ON,TEST,TEST,"$2,449,000, 4,735 Sq Ft, 6 Bed, 5.1 Bath, Listed 03/01/2016"
TEST,TEST,TEST,ON,TEST,TEST,"$2,895,000, 4,975 Sq Ft, 5 Bed, 4.1 Bath, Listed 01/03/2016"
Run Code Online (Sandbox Code Playgroud)
只有逗号列获得双引号括起来。我怎么能双引号我所有的数据列?
我想Scrapy输出:
"TEST","TEST","TEST","ON","TEST","TEST","$2,449,000, 4,735 Sq Ft, 6 Bed, 5.1 Bath, Listed 03/01/2016"
"TEST","TEST","TEST","ON","TEST","TEST","$2,895,000, 4,975 Sq Ft, 5 Bed, 4.1 Bath, Listed 01/03/2016"
Run Code Online (Sandbox Code Playgroud)
有什么我可以更改的设置吗?
默认情况下,对于CSV输出,Scrapy使用csv.writer()默认值。
对于字段引号,默认值为csv.QUOTE_MINIMAL:
指示编写器对象仅引用那些包含特殊字符(例如定界符,quotechar或lineterminator中的任何字符)的字段。
但是您可以构建自己的CSV项目导出器,并在默认'excel'方言的基础上设置新的方言。
例如,在一个exporters.py模块中,定义以下内容
import csv
from scrapy.exporters import CsvItemExporter
class QuoteAllDialect(csv.excel):
quoting = csv.QUOTE_ALL
class QuoteAllCsvItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
kwargs.update({'dialect': QuoteAllDialect})
super(QuoteAllCsvItemExporter, self).__init__(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)
然后,您只需要在设置中引用此项目导出器即可输出CSV,例如:
FEED_EXPORTERS = {
'csv': 'myproject.exporters.QuoteAllCsvItemExporter',
}
Run Code Online (Sandbox Code Playgroud)
和一个像这样的简单蜘蛛:
import scrapy
class ExampleSpider(scrapy.Spider):
name = "example"
allowed_domains = ["example.com"]
start_urls = ['http://example.com/']
def parse(self, response):
yield {
"name": "Some name",
"title": "Some title, baby!",
"description": "Some description, with commas, quotes (\") and all"
}
Run Code Online (Sandbox Code Playgroud)
将输出以下内容:
"description","name","title"
"Some description, with commas, quotes ("") and all","Some name","Some title, baby!"
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
693 次 |
| 最近记录: |