dat*_*den 14 python web-crawler scrapy
有:
from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess
Run Code Online (Sandbox Code Playgroud)
我总是成功地运行这个过程:
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
Run Code Online (Sandbox Code Playgroud)
但是因为我已将此代码移动到web_crawler(self)函数中,如下所示:
def web_crawler(self):
# set up a crawler
process = CrawlerProcess(get_project_settings())
process.crawl(*args)
# the script will block here until the crawling is finished
process.start()
# (...)
return (result1, result2)
Run Code Online (Sandbox Code Playgroud)
并开始使用类实例化调用该方法,如:
def __call__(self):
results1 = test.web_crawler()[1]
results2 = test.web_crawler()[0]
Run Code Online (Sandbox Code Playgroud)
和运行:
test()
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
Traceback (most recent call last):
File "test.py", line 573, in <module>
print (test())
File "test.py", line 530, in __call__
artists = test.web_crawler()
File "test.py", line 438, in web_crawler
process.start()
File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 280, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1194, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1174, in startRunning
ReactorBase.startRunning(self)
File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
twisted.internet.error.ReactorNotRestartable
Run Code Online (Sandbox Code Playgroud)
怎么了?
Fer*_*ard 23
您无法重新启动反应堆,但您应该能够通过分支单独的进程来运行它多次:
import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor
# your spider
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/tag/humor/']
def parse(self, response):
for quote in response.css('div.quote'):
print(quote.css('span.text::text').extract_first())
# the wrapper to make it run more times
def run_spider(spider):
def f(q):
try:
runner = crawler.CrawlerRunner()
deferred = runner.crawl(spider)
deferred.addBoth(lambda _: reactor.stop())
reactor.run()
q.put(None)
except Exception as e:
q.put(e)
q = Queue()
p = Process(target=f, args=(q,))
p.start()
result = q.get()
p.join()
if result is not None:
raise result
Run Code Online (Sandbox Code Playgroud)
运行两次:
print('first run:')
run_spider(QuotesSpider)
print('\nsecond run:')
run_spider(QuotesSpider)
Run Code Online (Sandbox Code Playgroud)
结果:
first run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
second run:
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“A day without sunshine is like, you know, night.”
...
Run Code Online (Sandbox Code Playgroud)
这有助于我赢得反对ReactorNotRestartable错误的战斗:问题的作者的最后答案
0)pip install crochet
1)import from crochet import setup
2)setup()- 在文件的顶部
3)删除2行:
a)d.addBoth(lambda _: reactor.stop())
b)reactor.run()
我有同样的问题有了这个错误,并花了4个多小时来解决这个问题,请阅读有关它的所有问题.终于找到了一个 - 并分享它.这就是我解决这个问题的方法.来自Scrapy docs的唯一有意义的行是我的代码中的最后两行:
#some more imports
from crochet import setup
setup()
def run_spider(spiderName):
module_name="first_scrapy.spiders.{}".format(spiderName)
scrapy_var = import_module(module_name) #do some dynamic import of selected spider
spiderObj=scrapy_var.mySpider() #get mySpider-object from spider module
crawler = CrawlerRunner(get_project_settings()) #from Scrapy docs
crawler.crawl(spiderObj) #from Scrapy docs
Run Code Online (Sandbox Code Playgroud)
这段代码允许我选择运行什么蜘蛛,只需将其名称传递给run_spider函数,并在废弃完成后再选择另一个蜘蛛并再次运行它.
希望这会对某人有所帮助,因为它对我有帮助:)
| 归档时间: |
|
| 查看次数: |
14746 次 |
| 最近记录: |