小编Mic*_*ael的帖子

Scrapy抓取第一页但不跟踪链接

我无法弄清楚为什么Scrapy正在抓取第一页但没有跟随链接来抓取后续页面.它必须与规则有关.非常感激.谢谢!

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist_sample.items import CraigslistItem

class MySpider(CrawlSpider):
    name = "craig"
    allowed_domains = ["sfbay.craigslist.org"]
    start_urls = ["http://sfbay.craigslist.org/acc/"]   

    rules = (Rule (SgmlLinkExtractor(allow=("index100\.html", ),restrict_xpaths=('//p[@id="nextpage"]',))
    , callback="parse_items", follow= True),
    )   

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
        items = []
        for titles in titles:
            item = CraigslistItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return(items)

spider = MySpider()
Run Code Online (Sandbox Code Playgroud)

python scrapy

4
推荐指数
1
解决办法
4006
查看次数

标签 统计

python ×1

scrapy ×1