Python，scrapy：从带有字符集iso-8859-1的已刮html页面的文件中写入错误的utf8字符

Question

Python，scrapy：从带有字符集iso-8859-1的已刮html页面的文件中写入错误的utf8字符

Pie*_*scy 1 python utf-8 character-encoding scrapy python-2.7

我想iso-8859-1在python 2.7中用Scrapy抓取一个带有字符集的网页。我在网页上感兴趣的文字是：tempête

Scrapy以UTF8 Unicode返回响应，并带有正确编码的字符：

>>> response
u'temp\xc3\xaate'

Run Code Online (Sandbox Code Playgroud)

现在，我想将该单词写到tempête文件中，所以我要执行以下操作：

>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var

Run Code Online (Sandbox Code Playgroud)

当我打开文件时，结果文本为tempÃªte。似乎python无法检测到正确的编码，并且无法读取两个字节编码的char，并认为它是两个单编码的char。

如何处理这个简单的用例？

Answer 1

pau*_*rth 5

在您的示例中，response是一个（解码的）Unicode字符串，\xc3\xa内部带有，那么在scrapy编码检测级别上出现了问题。

\xc3\xa是字符ê编码为UTF-8，因此您应该只看到（编码的）非Unicode / str字符串的那些字符（在Python 2中）

Python 2.7 Shell会话：

>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'

>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>> 

>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempÃªte
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>> 
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>>

Run Code Online (Sandbox Code Playgroud)

Scrapy解释页面iso-8859-1编码后出了点问题。

您可以通过重新构建来自的响应来强制编码response.body：

>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>>

Run Code Online (Sandbox Code Playgroud)

建立新的回应

newresponse = response.replace(encoding='utf-8')

Run Code Online (Sandbox Code Playgroud)

并与工作newresponse，而不是

归档时间：	9 年，11 月前
查看次数：	1011 次
最近记录：	9 年，11 月前