Scrapy:将response.body保存为html文件？

Question

Scrapy:将response.body保存为html文件？

bon*_*low 8 python django web-crawler scrapy

我的蜘蛛有效,但我无法下载我在.html文件中抓取的网站主体.如果我写self.html_fil.write('test')那么它工作正常.我不知道如何将tulpe转换成字符串.

我使用Python 3.6

蜘蛛:

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'
        self.html_file = open(self.path_to_html, 'w')

    def parse(self, response):
        url = response.url
        self.html_file.write(response.body)
        self.html_file.close()
        yield {
            'url': url
        }

Run Code Online (Sandbox Code Playgroud)

货物跟踪:

Traceback (most recent call last):
  File "c:\python\python36-32\lib\site-packages\twisted\internet\defer.py", line
 653, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\Users\kv\AtomProjects\example_project\example_bot\example_bot\spiders
\example.py", line 35, in parse
    self.html_file.write(response.body)
TypeError: write() argument must be str, not bytes

Run Code Online (Sandbox Code Playgroud)

Answer 1

Som*_*mil 11

实际问题是你得到字节码.您需要将其转换为字符串格式.有许多方法可以将字节转换为字符串格式.您可以使用

 self.html_file.write(response.body.decode("utf-8"))

Run Code Online (Sandbox Code Playgroud)

代替

  self.html_file.write(response.body)

Run Code Online (Sandbox Code Playgroud)

我建议使用已经是Unicode的`response.text`(因为编码可能不是UTF-8),而不是`response.body.decode("utf-8")` (7认同)

Answer 2

nir*_*msu 6

正确的方法是使用response.text，而不是response.body.decode("utf-8")。引用文档：

请记住，这Response.body始终是一个字节对象。如果要使用unicode版本TextResponse.text（仅在TextResponse和子类中可用）。

和

text：响应主体，为unicode。

与相同response.body.decode(response.encoding)，但是结果在第一次调用后被缓存，因此您可以访问response.text多次而无需额外的开销。

注意：unicode(response.body)这不是将响应主体转换为unicode的正确方法：您将使用系统默认编码（通常为ascii）而不是响应编码。

Answer 3

Mar*_*uiz 5

考虑到上面的响应，并使其尽可能地Python化，添加语句的使用with，该示例应重写为：

class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ['google.com']
    start_urls = ['http://google.com/']

    def __init__(self):
        self.path_to_html = html_path + 'index.html'
        self.path_to_header = header_path + 'index.html'

    def parse(self, response):
        with open(self.path_to_html, 'w') as html_file:
            html_file.write(response.text)
        yield {
            'url': response.url
        }

Run Code Online (Sandbox Code Playgroud)

但html_filewill 只能通过该方法访问parse。

归档时间：	8 年，11 月前
查看次数：	9338 次
最近记录：	7 年，2 月前