Ahs*_*lam 7 python twisted web-crawler scrapy scrapy-spider
我正在使用脚本文件在scrapy项目中运行Spider,Spider正在记录搜寻器的输出/结果。但是我想在某些功能的脚本文件中使用蜘蛛输出/结果。我不想将输出/结果保存在任何文件或数据库中。这是从https://doc.scrapy.org/en/latest/topics/practices.html#run-from-script获得的脚本代码
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner(get_project_settings())
d = runner.crawl('my_spider')
d.addBoth(lambda _: reactor.stop())
reactor.run()
def spider_output(output):
# do something to that output
Run Code Online (Sandbox Code Playgroud)
我如何在“ spider_output”方法中获得蜘蛛输出。有可能获得输出/结果。
Ahs*_*lam 11
这是在列表中获取所有输出/结果的解决方案
from scrapy import signals
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from scrapy.signalmanager import dispatcher
def spider_results():
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process = CrawlerProcess(get_project_settings())
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
return results
if __name__ == '__main__':
print(spider_results())
Run Code Online (Sandbox Code Playgroud)
这是一个老问题,但供将来参考。如果您使用的是 python 3.6+,我建议您使用scrapyscript,它允许您运行 Spiders 并以超级简单的方式获得结果:
from scrapyscript import Job, Processor
from scrapy.spiders import Spider
from scrapy import Request
import json
# Define a Scrapy Spider, which can accept *args or **kwargs
# https://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
class PythonSpider(Spider):
name = 'myspider'
def start_requests(self):
yield Request(self.url)
def parse(self, response):
title = response.xpath('//title/text()').extract()
return {'url': response.request.url, 'title': title}
# Create jobs for each instance. *args and **kwargs supplied here will
# be passed to the spider constructor at runtime
githubJob = Job(PythonSpider, url='http://www.github.com')
pythonJob = Job(PythonSpider, url='http://www.python.org')
# Create a Processor, optionally passing in a Scrapy Settings object.
processor = Processor(settings=None)
# Start the reactor, and block until all spiders complete.
data = processor.run([githubJob, pythonJob])
# Print the consolidated results
print(json.dumps(data, indent=4))
Run Code Online (Sandbox Code Playgroud)
[
{
"title": [
"Welcome to Python.org"
],
"url": "https://www.python.org/"
},
{
"title": [
"The world's leading software development platform \u00b7 GitHub",
"1clr-code-hosting"
],
"url": "https://github.com/"
}
]
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
2796 次 |
| 最近记录: |