Scrapy-如何使用AWS Lambda函数运行Spider?

ole*_*rio 6 python amazon-web-services scrapy aws-lambda

目前,我有两个使用Scrapy的小项目。一个项目基本上是刮URL的,而另一个项目只是刮URL的产品。目录结构是这样的:

.
??? requirements.txt
??? .venv
??? url
|   ??? geckodriver
|   ??? scrapy.cfg
|   ??? url
|   |   ??? items.py
|   |   ??? middlewares.py
|   |   ??? pipelines.py
|   |   ??? settings.py
|   |   ??? spiders
|   |   |    ??? store1.py
|   |   |    ??? store2.py
|   |   |    ??? ...
??? product
|   ??? geckodriver
|   ??? scrapy.cfg
|   ??? product
|   |   ??? items.py
|   |   ??? middlewares.py
|   |   ??? ...
Run Code Online (Sandbox Code Playgroud)

当我想使用命令运行Spider时,我总是必须遵循以下路径:~/search/url$ scrapy crawl store1~/search/product$ scrapy crawl store1

如何使用AWS lambda函数部署和运行该项目?

小智 0

此代码是客户之前项目中使用的脚本的一部分。

只需将 Spider_class_getting_from_spiders 替换为您的蜘蛛类即可。

import imp
import sys
from crochet import setup, wait_for
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from settings import *
from spiders import *


setup()
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")



@wait_for(900) # maximum 15 minutes 
def crawl(spider_class_getting_from_spiders):
    '''
    wait_for(Timeout = inseconds)
    change the timeout accordingly
    this function will raise crochet.TimeoutError if more than 900
    seconds elapse without an answer being received

    '''

    configure_logging({'LOG_LEVEL': 'ERROR'})
    process = CrawlerRunner(DOWNLOADER_MIDDLEWARES);
    d = process. Crawl(spider_class_getting_from_spiders);
    return d;


def lambda_handler(event, context):

    crawl(spider_class_getting_from_spiders)
    # it return the whole event instead of return on the response to send the input to next state in the step function
    event['statusCode'] = 200
    return event

Run Code Online (Sandbox Code Playgroud)