Python,scrapy:从带有字符集iso-8859-1的已刮html页面的文件中写入错误的utf8字符

Pie*_*scy 1 python utf-8 character-encoding scrapy python-2.7

我想iso-8859-1在python 2.7中用Scrapy抓取一个带有字符集的网页。我在网页上感兴趣的文字是:tempête

Scrapy以UTF8 Unicode返回响应,并带有正确编码的字符:

>>> response
u'temp\xc3\xaate'
Run Code Online (Sandbox Code Playgroud)

现在,我想将该单词写到tempête文件中,所以我要执行以下操作:

>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var
Run Code Online (Sandbox Code Playgroud)

当我打开文件时,结果文本为tempête。似乎python无法检测到正确的编码,并且无法读取两个字节编码的char,并认为它是两个单编码的char。

如何处理这个简单的用例?

pau*_*rth 5

在您的示例中,response是一个(解码的)Unicode字符串,\xc3\xa内部带有,那么在scrapy编码检测级别上出现了问题。

\xc3\xa是字符ê编码为UTF-8,因此您应该只看到(编码的)非Unicode / str字符串的那些字符(在Python 2中)

Python 2.7 Shell会话:

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempête
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>> 
Run Code Online (Sandbox Code Playgroud)

Scrapy解释页面iso-8859-1编码后出了点问题。

您可以通过重新构建来自的响应来强制编码response.body

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>> 
Run Code Online (Sandbox Code Playgroud)

建立新的回应

newresponse = response.replace(encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)

并与工作newresponse,而不是