我正在尝试使SgmlLinkExtractor工作.
这是签名:
SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)
Run Code Online (Sandbox Code Playgroud)
我正在使用 allow=()
所以,我进入
rules = (Rule(SgmlLinkExtractor(allow=("/aadler/", )), callback='parse'),)
Run Code Online (Sandbox Code Playgroud)
所以,最初的网址是'http://www.whitecase.com/jacevedo/',我正在进入,allow=('/aadler',)并期望
'/aadler/'也会被扫描.但相反,蜘蛛扫描初始网址然后关闭:
[wcase] INFO: Domain opened
[wcase] DEBUG: Crawled </jacevedo/> (referer: <None>)
[wcase] INFO: Passed NuItem(school=[u'JD, ', u'Columbia Law School, Harlan Fiske Stone Scholar, Parker School Recognition of Achievement in International and Foreign Law, ', u'2005'])
[wcase] INFO: Closing domain (finished)
Run Code Online (Sandbox Code Playgroud)
我在这做错了什么?
这里有没有人成功使用Scrapy谁可以帮助我完成这个蜘蛛?
感谢您的帮助.
我在下面包含了蜘蛛的代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor …Run Code Online (Sandbox Code Playgroud)