Ami*_*ini 1 python encoding scrapy python-2.7 scrapy-spider
当我尝试HtmlResponse在Scrapy中构造一个像这样的对象:
scrapy.http.HtmlResponse(url=self.base_url + dealer_url[0], body=dealer_html)
Run Code Online (Sandbox Code Playgroud)
我收到了这个错误:
Traceback (most recent call last):
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\fwi.py", line 69, in parse_items
dealer_page = scrapy.http.HtmlResponse(url=self.base_url + dealer_url[0], body=dealer_html)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\text.py", line 27, in __init__
super(TextResponse, self).__init__(*args, **kwargs)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\__init__.py", line 18, in __init__
self._set_body(body)
File "d:\kerja\hit\python~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\response\text.py", line 43, in _set_body
type(self).__name__)
TypeError: Cannot convert unicode body - HtmlResponse has no encoding
Run Code Online (Sandbox Code Playgroud)
有谁知道如何解决这个错误?
HtmlResponse正在尝试检测编码:
HtmlResponse类是TextResponse的子类,它通过查看HTML元http-equiv属性来添加编码自动发现支持.请参阅TextResponse.encoding.
所以基本上你提供给body参数的html字符串(dealer_html在你的情况下)没有指定编码.按照它的w3文档http-equiv应该有:
Run Code Online (Sandbox Code Playgroud)HTML 4.01: <meta http-equiv="content-type" content="text/html; charset=UTF-8"> HTML5: <meta charset="UTF-8">
在这种情况下,您可以HtmlResponse通过encoding参数创建对象时修复html或指定编码:
HtmlResponse(url='http://scrapy.org', body=u'some body', encoding='utf-8')
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2120 次 |
| 最近记录: |