如何在Scrapy中创建基于href的LinkExtractor规则

San*_*esh 5 python regex scrapy web-scraping

我正在尝试用Scrapy(scrapy.org)创建简单的爬虫.根据例子item.php是允许的.如何编写允许始终http://example.com/category/GET参数开头的url的规则page应该与其他参数一起使用任意数量的数字.这些参数的顺序是随机的.请帮忙我怎么写这样的规则?

几个有效值是:

以下是代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/category/']

rules = (
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item
Run Code Online (Sandbox Code Playgroud)

ale*_*cxe 5

http://example.com/category/在字符串的开头测试,并page在值中包含一个或多个数字的参数:

Rule(LinkExtractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),
Run Code Online (Sandbox Code Playgroud)

演示(使用您的示例网址):

>>> import re
>>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)')
>>> should_match = [
...     'http://example.com/category/?sort=a-z&page=1',
...     'http://example.com/category/?page=1&sort=a-z&cache=1',
...     'http://example.com/category/?page=1&sort=a-z#'
... ]
>>> for url in should_match:
...     print "Matches" if pattern.search(url) else "Doesn't match"
... 
Matches
Matches
Matches
Run Code Online (Sandbox Code Playgroud)