SIM*_*SIM 0 csv scrapy web-scraping python-3.x scrapy-spider
我在python scrapy中编写了一个非常小的脚本来解析黄页网站上多个页面显示的名称,街道和电话号码.当我运行我的脚本时,我发现它运行顺利.但是,我遇到的唯一问题是数据在csv输出中被刮掉的方式.它总是两行之间的行(行)间隙.我的意思是:数据每隔一行打印一次.看到下面的图片,你就会明白我的意思.如果不是scrapy,我可以使用[newline =''].但是,不幸的是我在这里完全无助.如何摆脱csv输出中出现的空白行?提前谢谢你看看它.
items.py包括:
import scrapy
class YellowpageItem(scrapy.Item):
name = scrapy.Field()
street = scrapy.Field()
phone = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)
这是蜘蛛:
import scrapy
class YellowpageSpider(scrapy.Spider):
name = "YellowpageSp"
start_urls = ["https://www.yellowpages.com/search?search_terms=Pizza&geo_location_terms=Los%20Angeles%2C%20CA&page={0}".format(page) for page in range(2,6)]
def parse(self, response):
for titles in response.css('div.info'):
name = titles.css('a.business-name span[itemprop=name]::text').extract_first()
street = titles.css('span.street-address::text').extract_first()
phone = titles.css('div[itemprop=telephone]::text').extract_first()
yield {'name': name, 'street': street, 'phone':phone}
Run Code Online (Sandbox Code Playgroud)
以下是csv输出的样子:
顺便说一句,我用来获取csv输出的命令是:
scrapy crawl YellowpageSp -o items.csv -t csv
Run Code Online (Sandbox Code Playgroud)
您可以通过创建新的FeedExporter来修复它.改变你settings.py的如下
FEED_EXPORTERS = {
'csv': 'project.exporters.FixLineCsvItemExporter',
}
Run Code Online (Sandbox Code Playgroud)
exporters.py在您的项目中创建一个
exporters.py
import io
import os
import six
import csv
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.extensions.feedexport import IFeedStorage
from w3lib.url import file_uri_to_path
from zope.interface import implementer
@implementer(IFeedStorage)
class FixedFileFeedStorage(object):
def __init__(self, uri):
self.path = file_uri_to_path(uri)
def open(self, spider):
dirname = os.path.dirname(self.path)
if dirname and not os.path.exists(dirname):
os.makedirs(dirname)
return open(self.path, 'ab')
def store(self, file):
file.close()
class FixLineCsvItemExporter(CsvItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
super(FixLineCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)
self._configure(kwargs, dont_fail=True)
self.stream.close()
storage = FixedFileFeedStorage(file.name)
file = storage.open(file.name)
self.stream = io.TextIOWrapper(
file,
line_buffering=False,
write_through=True,
encoding=self.encoding,
newline="",
) if six.PY3 else file
self.csv_writer = csv.writer(self.stream, **kwargs)
Run Code Online (Sandbox Code Playgroud)
我在Mac上,因此无法测试其Windows行为.但如果上面不起作用,那么改变下面的部分代码并设置newline="\n"
self.stream = io.TextIOWrapper(
file,
line_buffering=False,
write_through=True,
encoding=self.encoding,
newline="\n",
) if six.PY3 else file
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
832 次 |
| 最近记录: |