关注超链接和"过滤后的异地请求"

Mac*_*ace 10 python callback scrapy web-scraping

我知道那里有几个相关的线程,他们帮助了我很多,但我仍然不能一路走来.我正处于运行代码不会导致错误的程度,但我的csv文件中没有任何内容.我有以下Scrapy蜘蛛在一个网页上开始,然后跟随一个超链接,并抓取链接的页面:

from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field

class bbrItem(Item):
    Year = Field()
    AppraisalDate = Field()
    PropertyValue = Field()
    LandValue = Field()
    Usage = Field()
    LandSize = Field()
    Address = Field()    

class spiderBBRTest(BaseSpider):
    name = 'spiderBBRTest'
    allowed_domains = ["http://boliga.dk"]
    start_urls = ['http://www.boliga.dk/bbr/resultater?sort=hus_nr_sort-a,etage-a,side-a&gade=Septembervej&hus_nr=29&ipostnr=2730']

    def parse2(self, response):        
        hxs = HtmlXPathSelector(response)
        bbrs2 = hxs.select("id('evaluationControl')/div[2]/div")
        bbrs = iter(bbrs2)
        next(bbrs)
        for bbr in bbrs:
            item = bbrItem()
            item['Year'] = bbr.select("table/tbody/tr[1]/td[2]/text()").extract()
            item['AppraisalDate'] = bbr.select("table/tbody/tr[2]/td[2]/text()").extract()
            item['PropertyValue'] = bbr.select("table/tbody/tr[3]/td[2]/text()").extract()
            item['LandValue'] = bbr.select("table/tbody/tr[4]/td[2]/text()").extract()
            item['Usage'] = bbr.select("table/tbody/tr[5]/td[2]/text()").extract()
            item['LandSize'] = bbr.select("table/tbody/tr[6]/td[2]/text()").extract()
            item['Address']  = response.meta['address']
            yield item

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        PartUrl = ''.join(hxs.select("id('searchresult')/tr/td[1]/a/@href").extract())
        url2 = ''.join(["http://www.boliga.dk", PartUrl])
        yield Request(url=url2, meta={'address': hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2)
Run Code Online (Sandbox Code Playgroud)

我试图将结果导出到csv文件,但我得不到任何文件.但是,运行代码不会导致任何错误.我知道这是一个简单的例子,只有一个URL,但它说明了我的问题.

我想我的问题可能是我没有告诉Scrapy我要在Parse2方法中保存数据.

顺便说一下,我把蜘蛛当作了 scrapy crawl spiderBBR -o scraped_data.csv -t csv

Tal*_*lin 28

您需要修改产生Requestparse使用parse2它的回调.

编辑:allowed_domains不应包括http前缀,例如:

allowed_domains = ["boliga.dk"]
Run Code Online (Sandbox Code Playgroud)

试试看,看看你的蜘蛛是否仍能正常运行,而不是allowed_domains留空

  • 离开`allowed_domains`blank或删除`http`前缀可以解决问题.其他问题只是错别字,与问题的主题无关.谢谢你的回答! (4认同)

小智 9

试着做这个 dont_filter=true

yield Request(url=url2, meta{'address':hxs.select("id('searchresult')/tr/td[1]/a[@href]/text()").extract()}, callback=self.parse2,dont_filter=True)