使用Scrapy在Python中格式化文本输出

use*_*057 4 python text scrapy web-scraping

我正在尝试使用Scrapy蜘蛛抓取页面,然后将这些页面以可读的形式保存到.txt文件中.我用来做这个的代码是:

def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url) 

        hxs = HtmlXPathSelector(response)

        title = hxs.select('/html/head/title/text()').extract() 
        content = hxs.select('//*[@id="content"]').extract() 

        texts = "%s\n\n%s" % (title, content) 

        soup = BeautifulSoup(''.join(texts)) 

        strip = ''.join(BeautifulSoup(pretty).findAll(text=True)) 

        filename = ("/Users/username/path/output/Hansard-" + '%s'".txt") % (title) 
        filly = open(filename, "w")
        filly.write(strip) 
Run Code Online (Sandbox Code Playgroud)

我在这里结合了BeautifulSoup,因为正文包含了我在最终产品(主要是链接)中不需要的大量HTML,因此我使用BS去除HTML并仅留下感兴趣的文本.

这给了我看起来像的输出

[u"School, Chandler's Ford (Hansard, 30 November 1961)"]

[u'

 \n      \n

  HC Deb 30 November 1961 vol 650 cc608-9

 \n

  608

 \n

  \n


  \n

   \n

    \xa7

   \n

    28.

   \n


     Dr. King


   \n

    \n            asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler\'s Ford; and why he refused permission to acquire this site in 1954.\n

   \n

  \n

 \n      \n

  \n


  \n

   \n

    \xa7

   \n


     Sir D. Eccles


   \n

    \n            I understand that the authority has paid \xa375,000 for this site.\n            \n
Run Code Online (Sandbox Code Playgroud)

虽然我希望输出看起来像:

    School, Chandler's Ford (Hansard, 30 November 1961)

          HC Deb 30 November 1961 vol 650 cc608-9

          608

            28.

Dr. King asked the Minister of Education what is the price at which the Hampshire education authority is acquiring the site for the erection of Oakmount Secondary School, Chandler's Ford; and why he refused permission to acquire this site in 1954.

Sir D. Eccles I understand that the authority has paid £375,000 for this site.
Run Code Online (Sandbox Code Playgroud)

所以我基本上都在寻找如何删除新行指标\n,收紧所有内容,并将任何特殊字符转换为正常格式.

rec*_*dev 8

我在代码注释中的回答:

import re
import codecs

#...
#...
#extract() returns list, so you need to take first element
title = hxs.select('/html/head/title/text()').extract() [0]
content = hxs.select('//*[@id="content"]')
#instead of using BeautifulSoup for this task, you can use folowing
content = content.select('string()').extract()[0]

#simply delete duplicating spaces and newlines, maybe you need to adjust this expression
cleaned_content = re.sub(ur'(\s)\s+', ur'\1', content, flags=re.MULTILINE + re.UNICODE)

texts = "%s\n\n%s" % (title, cleaned_content) 

#look's like typo in filename creation
#filename ....

#and my preferable way to write file with encoding
with codecs.open(filename, 'w', encoding='utf-8') as output:
    output.write(texts)
Run Code Online (Sandbox Code Playgroud)