我正在使用scrapy来获取数据,我想使用flask web框架在网页中显示结果.但我不知道如何在烧瓶应用程序中调用蜘蛛.我试图用来CrawlerProcess调用我的蜘蛛,但是我得到了这样的错误:
ValueError
ValueError: signal only works in main thread
Traceback (most recent call last)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__
return self.wsgi_app(environ, start_response)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app
response = self.make_response(self.handle_exception(e))
File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app
response = self.full_dispatch_request()
File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request
rv = self.dispatch_request() …Run Code Online (Sandbox Code Playgroud) API应该允许包含用户想要抓取的URL的任意HTTP get请求,然后Flask应该返回scrape的结果.
以下代码适用于第一个http请求,但在反应器停止后,它将不会重新启动.我甚至可能不会以正确的方式解决这个问题,但我只是想在Heroku上放置一个RESTful scrapy API,到目前为止,我所能想到的就是它.
有没有更好的方法来构建这个解决方案?或者如何scrape_it在不停止扭曲的反应堆(不能再次启动)的情况下允许返回?
from flask import Flask
import os
import sys
import json
from n_grams.spiders.n_gram_spider import NGramsSpider
# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
app = Flask(__name__)
def scrape_it(url):
items = []
def add_item(item):
items.append(item)
runner = CrawlerRunner()
d = runner.crawl(NGramsSpider, [url])
d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???
dispatcher.connect(add_item, signal=signals.item_passed)
reactor.run(installSignalHandlers=0) # the script will block here until the …Run Code Online (Sandbox Code Playgroud) 我有两个测试类(TrialTest1和TrialTest2)写在两个文件(test_trial1.py和test_trial2.py)中大多数相同(唯一的区别是类名):
from twisted.internet import reactor
from twisted.trial import unittest
class TrialTest1(unittest.TestCase):
def setUp(self):
print("setUp()")
def test_main(self):
print("test_main")
reactor.callLater(1, self._called_by_deffered1)
reactor.run()
def _called_by_deffered1(self):
print("_called_by_deffered1")
reactor.callLater(1, self._called_by_deffered2)
def _called_by_deffered2(self):
print("_called_by_deffered2")
reactor.stop()
def tearDown(self):
print("tearDown()")
Run Code Online (Sandbox Code Playgroud)
当我完全运行每个测试时,一切都很好.但是当我启动它时,我有以下输出:
setUp()
test_main
_called_by_deffered1
_called_by_deffered2
tearDown()
setUp()
test_main
tearDown()
Error
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/twisted/internet/defer.py", line 137, in maybeDeferred
result = f(*args, **kw)
File "/usr/lib/python2.7/site-packages/twisted/internet/utils.py", line 203, in runWithWarningsSuppressed
reraise(exc_info[1], exc_info[2])
File "/usr/lib/python2.7/site-packages/twisted/internet/utils.py", line 199, in runWithWarningsSuppressed
result …Run Code Online (Sandbox Code Playgroud)