Sma*_*hed 3 python regex scrapy web-scraping
我正在尝试输出到 CSV,但我意识到在抓取 tripadvisor 时,我得到了很多回车,因此数组超过 30,而只有 10 条评论,所以我缺少很多字段。有没有办法删除回车。
蜘蛛。
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapingtest.items import ScrapingTestingItem
from collections import OrderedDict
import json
from scrapy.selector.lxmlsel import HtmlXPathSelector
import csv
import html2text
import unicodedata
class scrapingtestspider(Spider):
name = "scrapytesting"
allowed_domains = ["tripadvisor.in"]
base_uri = ["tripadvisor.in"]
start_urls = [
"http://www.tripadvisor.in/Hotel_Review-g297679-d736080-Reviews-Ooty_Elk_Hill_A_Sterling_Holidays_Resort-Ooty_Tamil_Nadu.html"]
def parse(self, response):
item = ScrapingTestingItem()
sel = HtmlXPathSelector(response)
converter = html2text.HTML2Text()
sites = sel.xpath('//a[contains(text(), "Next")]/@href').extract()
## dummy_test = [ "" for k in range(10)]
item['reviews'] = sel.xpath('//div[@class="col2of2"]//p[@class="partial_entry"]/text()').extract()
item['subjects'] = sel.xpath('//span[@class="noQuotes"]/text()').extract()
item['stars'] = sel.xpath('//*[@class="rating reviewItemInline"]//img/@alt').extract()
item['names'] = sel.xpath('//*[@class="username mo"]/span/text()').extract()
item['location'] = sel.xpath('//*[@class="location"]/text()').extract()
item['date'] = sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract()
item['date'] += sel.xpath('//div[@class="col2of2"]//span[@class="ratingDate"]/text()').extract()
startingrange = len(sel.xpath('//*[@class="ratingDate relativeDate"]/@title').extract())
for j in range(startingrange,len(item['date'])):
item['date'][j] = item['date'][j][9:].strip()
for i in range(len(item['stars'])):
item['stars'][i] = item['stars'][i][:1].strip()
for o in range(len(item['reviews'])):
print unicodedata.normalize('NFKD', unicode(item['reviews'][o])).encode('ascii', 'ignore')
for y in range(len(item['subjects'])):
item['subjects'][y] = unicodedata.normalize('NFKD', unicode(item['subjects'][y])).encode('ascii', 'ignore')
yield item
# print item['reviews']
if(sites and len(sites) > 0):
for site in sites:
yield Request(url="http://tripadvisor.in" + site, callback=self.parse)
Run Code Online (Sandbox Code Playgroud)
是否有可能使用正则表达式来遍历 for 循环并替换它。我尝试更换,但没有做任何事情。还有为什么scrapy会这样做。
我通常做的修剪和清理输出是使用带有项目加载器的输入和/或输出处理器- 它使事情更加模块化和干净:
class ScrapingTestingLoader(ItemLoader):
default_input_processor = MapCompose(unicode.strip)
default_output_processor = TakeFirst()
Run Code Online (Sandbox Code Playgroud)
然后,如果您将使用此 Item Loader 加载您的项目,您将获得提取的值并作为字符串(而不是列表)。例如,如果提取的字段是["my value \n"]
- 您将获得my value
作为输出。