Tom*_*ock 1 python scrapy scrapy-spider
我已经被困在这几天了,这让我发疯了.
我这样叫我的scrapy蜘蛛:
scrapy crawl example -a follow_links="True"
Run Code Online (Sandbox Code Playgroud)
我传入"follow_links"标志来确定是否应该删除整个网站,或者只是我在蜘蛛中定义的索引页面.
在spider的构造函数中检查此标志以查看应设置的规则:
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
Run Code Online (Sandbox Code Playgroud)
如果它是"真",则允许所有链接; 如果它是"假",则所有链接都被拒绝.
到目前为止,这么好,但这些规则被忽略了.我可以获得遵循规则的唯一方法是在构造函数之外定义它们.这意味着,像这样的东西会正常工作:
class ExampleSpider(CrawlSpider):
rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def __init__(self, *args, **kwargs):
...
Run Code Online (Sandbox Code Playgroud)
所以基本上,在__init__构造函数中定义规则会导致规则被忽略,而在构造函数之外定义规则会按预期工作.
我不明白这.我的代码如下.
import re
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from w3lib.html import remove_tags, remove_comments, replace_escape_chars, replace_entities, remove_tags_with_content
class ExampleSpider(CrawlSpider):
name = "example"
allowed_domains = ['example.com']
start_urls = ['http://www.example.com']
# if the rule below is uncommented, it works as expected (i.e. follow links and call parse_pages)
# rules = (
# Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
# )
def __init__(self, *args, **kwargs):
super(ExampleSpider, self).__init__(*args, **kwargs)
# single page or follow links
self.follow_links = kwargs.get('follow_links')
if self.follow_links == "True":
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(allow=()), callback="parse_pages", follow=True),
)
else:
# the rule below will always be ignored (why?!)
self.rules = (
Rule(LinkExtractor(deny=(r'[a-zA-Z0-9]*')), callback="parse_pages", follow=False),
)
def parse_pages(self, response):
print("In parse_pages")
print(response.xpath('/html/body').extract())
return None
def parse_start_url(self, response):
print("In parse_start_url")
print(response.xpath('/html/body').extract())
return None
Run Code Online (Sandbox Code Playgroud)
感谢您抽出宝贵时间帮助我解决此问题.
这里的问题是CrawlSpiderconstructor(__init__)也在处理rules参数,所以如果你需要分配它们,你必须在调用默认构造函数之前这样做.
换句话说,在致电之前做你需要的一切super(ExampleSpider, self).__init__(*args, **kwargs):
def __init__(self, *args, **kwargs):
# setting my own rules
super(ExampleSpider, self).__init__(*args, **kwargs)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
432 次 |
| 最近记录: |