Lon*_*oul 2 python scrapy web-scraping python-2.7
几次敲我的头后,我终于来到了这里.
问题:我正在尝试下载每个craiglist发布的内容.按内容我的意思是"发布机构",如手机的描述.寻找新的旧手机,因为iPhone完成了所有的兴奋.
代码是Michael Herman的精彩作品.
我的蜘蛛类
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import *
from craig.items import CraiglistSampleItem
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://minneapolis.craigslist.org/moa/"]
rules = (Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),restrict_xpaths=('//p[@class="nextpage"]',))
, callback="parse_items", follow= True),
)
def parse_items(self,response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
items = []
for titles in titles:
item = CraiglistSampleItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return items
Run Code Online (Sandbox Code Playgroud)
和Item类
from scrapy.item import Item, Field
class CraiglistSampleItem(Item):
title = Field()
link = Field()
Run Code Online (Sandbox Code Playgroud)
由于代码将遍历许多链接,因此我想在sepearte csv中保存每个手机的描述,但csv中的另一列也可以.
任何领先!
parse_items您应该返回/生成scrapy Request实例,而不是在方法中返回项目,以便从项目页面获取描述,link并且title您可以传入元数据字典内部Item和Item内部:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.http import Request
from scrapy.selector import *
from scrapy.item import Item, Field
class CraiglistSampleItem(Item):
title = Field()
link = Field()
description = Field()
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = ["http://minneapolis.craigslist.org/moa/"]
rules = (Rule(SgmlLinkExtractor(allow=("index\d00\.html", ), restrict_xpaths=('//p[@class="nextpage"]',))
, callback="parse_items", follow=True),
)
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//span[@class='pl']")
for title in titles:
item = CraiglistSampleItem()
item["title"] = title.select("a/text()").extract()[0]
item["link"] = title.select("a/@href").extract()[0]
url = "http://minneapolis.craigslist.org%s" % item["link"]
yield Request(url=url, meta={'item': item}, callback=self.parse_item_page)
def parse_item_page(self, response):
hxs = HtmlXPathSelector(response)
item = response.meta['item']
item['description'] = hxs.select('//section[@id="postingbody"]/text()').extract()
return item
Run Code Online (Sandbox Code Playgroud)
运行它并description在输出csv文件中查看其他列.
希望有所帮助.
| 归档时间: |
|
| 查看次数: |
2415 次 |
| 最近记录: |