JLR*_*JLR 5 python bots web-crawler scrapy scrapy-spider
我试图编写一个通用的“ Master”蜘蛛,将其与执行期间动态插入的“ start_urls”和“ allowed_domains”一起使用。(最终,我将这些存储在数据库中,然后将其拉出,然后用于初始化和抓取每个数据库条目的新蜘蛛。)
目前,我有两个文件:
为了编写这两个文件,我引用了以下内容:
我考虑了scrapyD,但我认为它不是我想要的...
这是我写的:
MySpider.py-
import scrapy
class BlackSpider(scrapy.Spider):
name = 'Black1'
def __init__(self, allowed_domains=[], start_urls=[], *args, **kwargs):
super(BlackSpider, self).__init__(*args, **kwargs)
self.start_urls = start_urls
self.allowed_domains = allowed_domains
#For Testing:
print start_urls
print self.start_urls
print allowed_domains
print self.allowed_domains
def parse(self, response):
#############################
# Insert my parse code here #
#############################
return items
Run Code Online (Sandbox Code Playgroud)
RunSpider.py-
import scrapy
from scrapy.crawler import CrawlerProcess
from MySpider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Initialize MySpider with the above allowed domain and start url
MySpider = BlackSpider(ad,sd)
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
Run Code Online (Sandbox Code Playgroud)
问题:
这是我的问题-执行此命令时,它似乎已成功将我的参数传递给allowed_domains和start_urls。但是,在初始化MySpider之后,当我运行蜘蛛进行爬网时,将不再找到指定的url /域,也不会爬网任何网站。我在上面添加了print语句以显示此内容:
me@mybox:~/$ python RunSpider.py
['http://example.com/files/subfile/dir1']
['http://example.com/files/subfile/dir1']
['example.com']
['example.com']
2016-02-26 16:11:41 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
...
2016-02-26 16:11:41 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
...
[]
[]
[]
[]
2016-02-26 16:11:41 [scrapy] INFO: Spider opened
...
2016-02-26 16:11:41 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...
2016-02-26 16:11:41 [scrapy] INFO: Closing spider (finished)
...
2016-02-26 16:11:41 [scrapy] INFO: Spider closed (finished)
Run Code Online (Sandbox Code Playgroud)
为什么我的Spider会正确初始化,但是当我尝试执行Spider时,URL丢失了?这是我刚刚缺少的基本Python编程(类?)错误吗?
CrawlerProcess.crawl()需要 acrawler或scrapy.Spider子类,而不是 a 的实例Spider.crawl()所以你需要做这样的事情:
import scrapy
from scrapy.crawler import CrawlerProcess
from myspider import BlackSpider
#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]
#Crawl MySpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
# pass Spider class, and other params as keyword arguments
process.crawl(MySpider, allowed_domains=ad, start_urls=sd)
process.start()
Run Code Online (Sandbox Code Playgroud)
您可以使用 scrapy 命令本身来查看此操作,例如scrapy runspider:
def run(self, args, opts):
...
spidercls = spclasses.pop()
self.crawler_process.crawl(spidercls, **opts.spargs)
self.crawler_process.start()
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1233 次 |
| 最近记录: |