在 Flask 中使用 Crawler Runner 时如何传递参数?

Cod*_*bit 3 python web-crawler scrapy flask

我已经阅读了scrapy -1.0.4关于如何以编程方式运行多个蜘蛛的官方文档。它提供了一种方法来做到这一点Crawler Runner,所以我在我的 Flask 应用程序中使用它。但是有一个问题,我想将一个参数传递Crawler给的一部分Start Urls。我不知道该怎么做。这是我的 Flask 应用程序代码:

app.route('/search_process', methods=['GET'])
def search():
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(EPGDspider)
    # runner.crawl(GDSpider)
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    reactor.run()
    return redirect(url_for('details'))
Run Code Online (Sandbox Code Playgroud)

这是我的蜘蛛代码:

__author__ = 'Rabbit'
import scrapy
from scrapy.selector import Selector
from scrapy import Request
from scrapy import Item, Field

class EPGD(Item):

    genID = Field()
    genID_url = Field()
    taxID = Field()
    taxID_url = Field()
    familyID = Field()
    familyID_url = Field()
    chromosome = Field()
    symbol = Field()
    description = Field()

class EPGDspider(scrapy.Spider):
    name = "EPGD"
    allowed_domains = ["epgd.biosino.org"]
    term = "man"
    start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
    MONGODB_DB = name + "_" + term
    MONGODB_COLLECTION = name + "_" + term

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
        url_list = []
        base_url = "http://epgd.biosino.org/EPGD"

        for site in sites:
            item = EPGD()
            item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
            item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
            item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
            item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
            item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
            item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
            item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
            item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
            item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
            yield item

        sel_tmp = Selector(response)
        link = sel_tmp.xpath('//span[@id="quickPage"]')

        for site in link:
            url_list.append(site.xpath('a/@href').extract())

        for i in range(len(url_list[0])):
            if cmp(url_list[0][i], "#") == 0:
                if i+1 < len(url_list[0]):
                    print url_list[0][i+1]
                    actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                    yield Request(actual_url, callback=self.parse)
                    break
                else:
                    print "The index is out of range!"
Run Code Online (Sandbox Code Playgroud)

如您所见,term现在已经在代码中设置了。我只想将参数term从 Flask App传递给我的蜘蛛并动态组合起始网址。它的效果有点像这个问题中的情况:How to pass a user defined argument in scrapy spider。但是所有的事情都是在 Flask App 中以编程方式完成的,而不是通过命令行。但是我不知道该怎么做,有人可以告诉我如何处理吗?

Cod*_*bit 5

我已经解决了这个问题crawl(crawler_or_spidercls, *args, **kwargs),你可以通过这个方法传递参数。这是我的 Flask 应用程序代码:

def search():
    configure_logging()
    runner = CrawlerRunner()
    runner.crawl(EPGDspider, term="man")
    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    reactor.run()
Run Code Online (Sandbox Code Playgroud)

还有我的蜘蛛代码(您可以覆盖该_init_方法并构建您的动态start urls):

def __init__(self, term=None, *args, **kwargs):
        super(EPGDspider, self).__init__(*args, **kwargs)
        self.start_urls = ['http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=%s&submit=Feeling+Lucky' % term]
Run Code Online (Sandbox Code Playgroud)