lor*_*771 3 html python scrapy web-scraping python-2.7
我使用Scrapy创建了一个网络刮板,它能够从这个网站上的每个票据中搜集元素,但由于页面上没有,所以不能刮取票价.当我尝试请求下一页来降低价格时,我无法得到错误:exceptions.TypeError:'XPathItemLoader'对象没有属性' getitem '.我只能使用项目加载器来抓取任何元素,这就是我目前正在使用的内容,而且我不确定将另一个页面上的已删除元素传递给项目加载器的正确过程(我已经看到了一种方法来实现它项目数据类型,但它不适用于此处).我想我可能在将元素提取到项目对象时遇到问题,因为我正在流水线化到数据库中,但我不确定.如果我下面发布的代码可以修改,以便正确爬行到下一页,刮掉价格,并将其添加到项目加载器,我认为应该解决问题.任何帮助将不胜感激.谢谢!
class MySpider(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.vividseats.com"]
start_urls = [vs_url]
tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
def parse_price(self, response):
#First attempt at trying to load price into item loader
loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
print 'ticket price'
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):
loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()')
loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()')
loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()')
loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')
ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader["ticketsLink"]
request = scrapy.Request(ticketsURL , callback = self.parse_price)
yield loader.load_item()
Run Code Online (Sandbox Code Playgroud)
要解决的关键问题:
从项目加载器获取值,使用get_output_value()
,替换:
loader["ticketsLink"]
Run Code Online (Sandbox Code Playgroud)
有:
loader.get_output_value("ticketsLink")
Run Code Online (Sandbox Code Playgroud)你需要传递请求的loader
内部meta
并在那里产生/返回加载的项目
在构造URL以获取价格时,用于urljoin()
将相对部分与当前URL连接
这是固定版本:
from urlparse import urljoin
# other imports
class MySpider(CrawlSpider):
handle_httpstatus_list = [416]
name = 'comparator'
allowed_domains = ["www.vividseats.com"]
start_urls = [vs_url]
tickets_list_xpath = './/*[@itemtype="http://schema.org/Event"]'
def parse_price(self, response):
loader = response.meta['loader']
loader.add_xpath('ticketPrice' , '//*[@class="eventTickets lastChild"]/div/div/@data-origin-price')
return loader.load_item()
def parse(self, response):
selector = HtmlXPathSelector(response)
# iterate over tickets
for ticket in selector.select(self.tickets_list_xpath):
loader = XPathItemLoader(ComparatorItem(), selector=ticket)
# define loader
loader.default_input_processor = MapCompose(unicode.strip)
loader.default_output_processor = Join()
# iterate over fields and add xpaths to the loader
loader.add_xpath('eventName' , './/*[@class="productionsEvent"]/text()')
loader.add_xpath('eventLocation' , './/*[@class = "productionsVenue"]/span[@itemprop = "name"]/text()')
loader.add_xpath('ticketsLink' , './/*/td[3]/a/@href')
loader.add_xpath('eventDate' , './/*[@class = "productionsDate"]/text()')
loader.add_xpath('eventCity' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressLocality"]/text()')
loader.add_xpath('eventState' , './/*[@class = "productionsVenue"]/span[@itemprop = "address"]/span[@itemprop = "addressRegion"]/text()')
loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')
ticketsURL = "concerts/" + bandname + "-tickets/" + bandname + "-" + loader.get_output_value("ticketsLink")
ticketsURL = urljoin(response.url, ticketsURL)
yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price)
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1071 次 |
最近记录: |