Pie*_*scy 1 python utf-8 character-encoding scrapy python-2.7
我想iso-8859-1在python 2.7中用Scrapy抓取一个带有字符集的网页。我在网页上感兴趣的文字是:tempête
Scrapy以UTF8 Unicode返回响应,并带有正确编码的字符:
>>> response
u'temp\xc3\xaate'
Run Code Online (Sandbox Code Playgroud)
现在,我想将该单词写到tempête文件中,所以我要执行以下操作:
>>> import codecs
>>> file = codecs.open('test', 'a', encoding='utf-8')
>>> file.write(response) #response is the above var
Run Code Online (Sandbox Code Playgroud)
当我打开文件时,结果文本为tempête。似乎python无法检测到正确的编码,并且无法读取两个字节编码的char,并认为它是两个单编码的char。
如何处理这个简单的用例?
在您的示例中,response是一个(解码的)Unicode字符串,\xc3\xa内部带有,那么在scrapy编码检测级别上出现了问题。
\xc3\xa是字符ê编码为UTF-8,因此您应该只看到(编码的)非Unicode / str字符串的那些字符(在Python 2中)
Python 2.7 Shell会话:
>>> # what your input should look like
>>> tempete = u'tempête'
>>> tempete
u'temp\xeate'
>>> # UTF-8 encoded
>>> tempete.encode('utf-8')
'temp\xc3\xaate'
>>>
>>> # latin1 encoded
>>> tempete.encode('iso-8859-1')
'temp\xeate'
>>>
>>> # back to your sample
>>> s = u'temp\xc3\xaate'
>>> print s
tempête
>>>
>>> # if you use a non-Unicode string with those characters...
>>> s_raw = 'temp\xc3\xaate'
>>> s_raw.decode('utf-8')
u'temp\xeate'
>>>
>>> # ... decoding from UTF-8 works
>>> print s_raw.decode('utf-8')
tempête
>>>
Run Code Online (Sandbox Code Playgroud)
Scrapy解释页面iso-8859-1编码后出了点问题。
您可以通过重新构建来自的响应来强制编码response.body:
>>> import scrapy.http
>>> hr1 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='latin1')
>>> hr1.body_as_unicode()
u'<html><body>temp\xc3\xaate</body></html>'
>>> hr2 = scrapy.http.HtmlResponse(url='http://www.example', body='<html><body>temp\xc3\xaate</body></html>', encoding='utf-8')
>>> hr2.body_as_unicode()
u'<html><body>temp\xeate</body></html>'
>>>
Run Code Online (Sandbox Code Playgroud)
建立新的回应
newresponse = response.replace(encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)
并与工作newresponse,而不是