Sza*_*lcs 2 twisted scrapy web-scraping python-3.x
我想创建一个调度程序脚本,以在一个序列中多次运行同一蜘蛛。
到目前为止,我得到了以下内容:
#!/usr/bin/python3
"""Scheduler for spiders."""
import time
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from my_project.spiders.deals import DealsSpider
def crawl_job():
"""Job to start spiders."""
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(DealsSpider)
process.start() # the script will block here until the end of the crawl
if __name__ == '__main__':
while True:
crawl_job()
time.sleep(30) # wait 30 seconds then crawl again
Run Code Online (Sandbox Code Playgroud)
目前,蜘蛛网第一次正常执行,然后经过一段时间的延迟,蜘蛛网再次启动,但是就在它开始抓取之前,我收到以下错误消息:
Traceback (most recent call last):
File "scheduler.py", line 27, in <module>
crawl_job()
File "scheduler.py", line 17, in crawl_job
process.start() # the script will block here until the end of the crawl
File "/usr/local/lib/python3.5/dist-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "/usr/local/lib/python3.5/dist-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Run Code Online (Sandbox Code Playgroud)
不幸的是,我不熟悉该Twisted框架及其框架Reactor,因此不胜感激!
您收到ReactorNotRestartable错误消息是因为Reactor无法在Twisted中多次启动。基本上,每次process.start()调用时,它将尝试启动反应堆。网络上有很多与此有关的信息。这是一个简单的解决方案:
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.deals import DealsSpider
def crawl_job():
"""
Job to start spiders.
Return Deferred, which will execute after crawl has completed.
"""
settings = get_project_settings()
runner = CrawlerRunner(settings)
return runner.crawl(DealsSpider)
def schedule_next_crawl(null, sleep_time):
"""
Schedule the next crawl
"""
reactor.callLater(sleep_time, crawl)
def crawl():
"""
A "recursive" function that schedules a crawl 30 seconds after
each successful crawl.
"""
# crawl_job() returns a Deferred
d = crawl_job()
# call schedule_next_crawl(<scrapy response>, n) after crawl job is complete
d.addCallback(schedule_next_crawl, 30)
d.addErrback(catch_error)
def catch_error(failure):
print(failure.value)
if __name__=="__main__":
crawl()
reactor.run()
Run Code Online (Sandbox Code Playgroud)
您的代码段有一些明显的不同。的reactor直接调用,替换CrawlerProcess为CrawlerRunner,time.sleep已被移除,以使反应器中不会阻塞,该while环已经被替换为连续的调用crawl经由功能callLater。它很短,应该做您想要的。如果有任何部分使您感到困惑,请告诉我,我将详细说明。
import datetime as dt
def schedule_next_crawl(null, hour, minute):
tomorrow = (
dt.datetime.now() + dt.timedelta(days=1)
).replace(hour=hour, minute=minute, second=0, microsecond=0)
sleep_time = (tomorrow - dt.datetime.now()).total_seconds()
reactor.callLater(sleep_time, crawl)
def crawl():
d = crawl_job()
# crawl everyday at 1pm
d.addCallback(schedule_next_crawl, hour=13, minute=30)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2133 次 |
| 最近记录: |