ole*_*rio 6 python amazon-web-services scrapy aws-lambda
目前,我有两个使用Scrapy的小项目。一个项目基本上是刮URL的,而另一个项目只是刮URL的产品。目录结构是这样的:
.
??? requirements.txt
??? .venv
??? url
| ??? geckodriver
| ??? scrapy.cfg
| ??? url
| | ??? items.py
| | ??? middlewares.py
| | ??? pipelines.py
| | ??? settings.py
| | ??? spiders
| | | ??? store1.py
| | | ??? store2.py
| | | ??? ...
??? product
| ??? geckodriver
| ??? scrapy.cfg
| ??? product
| | ??? items.py
| | ??? middlewares.py
| | ??? ...
Run Code Online (Sandbox Code Playgroud)
当我想使用命令运行Spider时,我总是必须遵循以下路径:~/search/url$ scrapy crawl store1或~/search/product$ scrapy crawl store1。
如何使用AWS lambda函数部署和运行该项目?
小智 0
此代码是客户之前项目中使用的脚本的一部分。
只需将 Spider_class_getting_from_spiders 替换为您的蜘蛛类即可。
import imp
import sys
from crochet import setup, wait_for
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from settings import *
from spiders import *
setup()
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
@wait_for(900) # maximum 15 minutes
def crawl(spider_class_getting_from_spiders):
'''
wait_for(Timeout = inseconds)
change the timeout accordingly
this function will raise crochet.TimeoutError if more than 900
seconds elapse without an answer being received
'''
configure_logging({'LOG_LEVEL': 'ERROR'})
process = CrawlerRunner(DOWNLOADER_MIDDLEWARES);
d = process. Crawl(spider_class_getting_from_spiders);
return d;
def lambda_handler(event, context):
crawl(spider_class_getting_from_spiders)
# it return the whole event instead of return on the response to send the input to next state in the step function
event['statusCode'] = 200
return event
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
633 次 |
| 最近记录: |