Scrapy 正在将唯一的 url 过滤为重复的 url

Question

Scrapy 正在将唯一的 url 过滤为重复的 url

网址：

http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=1
http://www.extrastores.com/en-sa/products/mobiles/smartphones-99500240157?page=2是唯一的，但 scrapy 会将这些 url 过滤为重复项，而不是抓取它们。

我使用 CrawlSpider 并遵循以下规则：

rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',), ), callback='parse_product'),
)`

Run Code Online (Sandbox Code Playgroud)

我不明白这种行为，有人可以解释一下吗？相同的代码上周正在运行。使用Scrapy版本1.3.0

Answer 1

jav*_*ved 3

根据 @paul trmbrth 的建议，我重新检查了被抓取的代码和网站。Scrapy正在下载链接并过滤链接，因为它们之前已经下载过。问题是 html 'a' 标签中的链接属性从静态链接更改为某些 JavaScript 函数：

<a href='javascript:gtm.traceProductClick("/en-sa/mobiles/smartphones/samsung-galaxy-s7-32gb-dual-sim-lte-gold-188024">

Run Code Online (Sandbox Code Playgroud)

相应地我将我的蜘蛛代码更改为：

    def _process_value(value):
    m = re.search('javascript:gtm.traceProductClick\("(.*?)"', value)
    if m:
        return m.group(1)


rules = (
    Rule(LinkExtractor(restrict_css=('.resultspagenum'))),
    Rule(LinkExtractor(
        allow=('\/mobiles\/smartphones\/[a-zA-Z0-9_.-]*',),
        process_value=_process_value
    ), callback='parse_product'),
)

Run Code Online (Sandbox Code Playgroud)

这不是 scrapy 过滤非唯一 url 的问题，而是关于从“a”标记的“href”属性中提取链接的问题，因为该链接最近已更改并且我的代码已损坏。再次感谢@paul trmbrth

归档时间：	8 年，7 月前
查看次数：	712 次
最近记录：	8 年，7 月前