如何从Scrapy获得UTF-8编码的unicode输出?

Cal*_*laf 4 scrapy

忍受我.我正在编写每个细节,因为工具链的很多部分都不能优雅地处理Unicode,并且不清楚什么是失败的.

序幕

我们首先建立并使用最近的Scrapy.

source ~/.scrapy_1.1.2/bin/activate
Run Code Online (Sandbox Code Playgroud)

由于终端的默认值是ascii,而不是unicode,我们设置:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
Run Code Online (Sandbox Code Playgroud)

此外,因为默认情况下Python使用ascii,我们修改编码:

export PYTHONIOENCODING="utf_8"
Run Code Online (Sandbox Code Playgroud)

现在我们准备开始一个Scrapy项目了.

scrapy startproject myproject
cd myproject
scrapy genspider dorf PLACEHOLDER
Run Code Online (Sandbox Code Playgroud)

我们被告知我们现在有一只蜘蛛.

Created spider 'dorf' using template 'basic' in module:
  myproject.spiders.dorf
Run Code Online (Sandbox Code Playgroud)

我们修改myproject/items.py为:

# -*- coding: utf-8 -*-
import scrapy

class MyprojectItem(scrapy.Item):
    title = scrapy.Field()
Run Code Online (Sandbox Code Playgroud)

ATTEMPT 1

现在我们写蜘蛛,依靠urllib.unquote

# -*- coding: utf-8 -*-
import scrapy
import urllib
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = urllib.unquote(
            response.xpath('//title').extract_first().encode('ascii')
        ).decode('utf8')
        return item
Run Code Online (Sandbox Code Playgroud)

最后我们使用自定义项目导出器(从2011年10月开始)

# -*- coding: utf-8 -*-
import json
from scrapy.exporters import BaseItemExporter

class UnicodeJsonLinesItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs)
        self.file = file
        self.encoder = json.JSONEncoder(ensure_ascii=False, **kwargs)

    def export_item(self, item):
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict) + '\n')
Run Code Online (Sandbox Code Playgroud)

并添加

FEED_EXPORTERS = {
    'json': 'myproject.exporters.UnicodeJsonLinesItemExporter',
}
Run Code Online (Sandbox Code Playgroud)

myproject/settings.py.

现在我们跑

~/myproject> scrapy crawl dorf -o dorf.json -t json
Run Code Online (Sandbox Code Playgroud)

我们得到

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 25: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

ATTEMPT 2

另一种解决方案(Scrapy 1.2的候选解决方案?)是使用蜘蛛

# -*- coding: utf-8 -*-
import scrapy
from myproject.items import MyprojectItem

class DorfSpider(scrapy.Spider):
    name = "dorf"
    allowed_domains = [u'http://en.sistercity.info/']
    start_urls = (
        u'http://en.sistercity.info/sister-cities/Düsseldorf.html',
    )

    def parse(self, response):
        item = MyprojectItem()
        item['title'] = response.xpath('//title')[0].extract()
        return item
Run Code Online (Sandbox Code Playgroud)

自定义项目导出器

# -*- coding: utf-8 -*-
from scrapy.exporters import JsonItemExporter

class Utf8JsonItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        super(Utf8JsonItemExporter, self).__init__(
            file, ensure_ascii=False, **kwargs)
Run Code Online (Sandbox Code Playgroud)

FEED_EXPORTERS = {
    'json': 'myproject.exporters.Utf8JsonItemExporter',
}
Run Code Online (Sandbox Code Playgroud)

myproject/settings.py.

我们得到以下JSON文件.

[
{"title": "<title>Sister cities of D\u00fcsseldorf \u2014 sistercity.info</title>"}
]
Run Code Online (Sandbox Code Playgroud)

Unicode不是UTF-8编码的.虽然这对于几个字符来说是一个微不足道的问题,但如果整个输出都是外语,则会成为一个严重的问题.

如何以UTF-8 unicode获得输出?

Mik*_*bov 13

在Scrapy 1.2+中有一个FEED_EXPORT_ENCODING选项.当FEED_EXPORT_ENCODING = "utf-8"关闭JSON输出中非ascii符号的转义时.