Python scrapy ReactorNotRestartable替代品

sag*_*gar 8 python reactor scrapy flask twisted.internet

我一直在尝试用Python创建一个Scrapy具有以下功能的应用程序:

  • 一个REST API(我曾提出,用瓶)监听所有的请求抓取/废料和爬行后返回响应.(爬行部分足够短,所以连接可以保持活动,直到爬行得到完成.)

我可以使用以下代码执行此操作:

items = []
def add_item(item):
    items.append(item)

# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)

# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable 
crawler.crawl(requestParams=requestParams)
# start crawling 
reactor.run() #@UndefinedVariable
return str(items)
Run Code Online (Sandbox Code Playgroud)

现在我面临的问题是在停止反应堆之后(这对我来说似乎是必要的,因为我不想坚持reactor.run()).第一次请求后我无法接受进一步的请求.第一次请求完成后,我收到以下错误:

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
    reactor.run() #@UndefinedVariable
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
ReactorNotRestartable
Run Code Online (Sandbox Code Playgroud)

这很明显,因为我们无法重启反应堆.

所以我的问题是:

1)我如何为下一次抓取请求提供支持?

2)有没有办法在reactor.run()之后移动到下一行而不停止它?

小智 1

这是解决您问题的简单方法

from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__) 

class myThread (threading.Thread):
    def __init__(self,target):
        threading.Thread.__init__(self)
        self.target = target
    def run(self):
        start_crawl()

def start_crawl():
    pid = subprocess.Popen([sys.executable, "start_request.py"])
    return


@app.route("/crawler/start") 
def start_req():
    print ":request"
    threadObj = myThread("run_crawler")
    threadObj.start()
    return "Your crawler is in running state" 
if (__name__ == "__main__"): 
    app.run(port = 5000)
Run Code Online (Sandbox Code Playgroud)

在上面的解决方案中,我假设您可以使用 shell/命令行上的命令 start_request.py 文件从命令行启动爬虫。

现在我们正在做的是使用 python 中的线程为每个传入请求启动一个新线程。现在,您可以轻松地针对每次点击并行运行爬网程序实例。只需使用 threading.activeCount() 控制线程数