scrapy:将html字符串转换为HtmlResponse对象

Question

scrapy:将html字符串转换为HtmlResponse对象

我有我要转换为HTML scrapy响应对象,这样我就可以使用选择一个原始的HTML串css并xpath,类似scrapy的response.我该怎么做？

Answer 1

首先,如果是出于调试或测试目的,您可以使用Scrapy shell:

$ cat index.html
<div id="test">
    Test text
</div>

$ scrapy shell index.html
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

Run Code Online (Sandbox Code Playgroud)

在会话期间,shell中有不同的对象,例如response和request.

或者,您可以实例化一个HtmlResponse类并在body以下位置提供HTML字符串:

>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="my HTML string", body='<div id="test">Test text</div>', encoding='utf-8')
>>> response.xpath('//div[@id="test"]/text()').extract()[0].strip()
u'Test text'

Run Code Online (Sandbox Code Playgroud)

编辑:

你可能需要 Scrapy shell

截至今天,HtmlResponse对象需要另一个参数,编码.您可以这样做:HtmlResponse(url ='http://scrapy.org',body = u'some body',encoding ='utf-8') (6认同)

Answer 2

小智 14

alecxe的答案是正确的，但这是在scrapy 中实例化 a Selectorfrom的正确方法：text

>>> from scrapy.selector import Selector
>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').get()

'good'

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，10 月前
查看次数：	8480 次
最近记录：	5 年，11 月前