如何在Scrapy Spider中获取管道对象

Question

如何在Scrapy Spider中获取管道对象

Pit*_*tty 5 python mongodb scrapy

我已经使用mongodb来存储爬网的数据。

现在我想查询数据的最后日期，这样我就可以继续抓取数据而不需要从url列表的开头重新启动它。（url可以由日期确定，例如：/ 2014-03-22 .html）

我只希望有一个连接对象来执行数据库操作，该操作正在进行中。

因此，我想知道如何在Spider中获取管道对象（不是新对象）。

或者，任何更好的增量更新解决方案...

提前致谢。

抱歉，我的英语不好...现在就试用：

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....
    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

Run Code Online (Sandbox Code Playgroud)

和蜘蛛：

class Spider(Spider):
    name = "test"
    ....

    def parse(self, response):
        # Want to get the Pipeline object
        mongo = MongoDBPipeline() # if take this way, must a new Pipeline object
        mongo.get_date()          # In scrapy, it must have a Pipeline object for the spider
                                  # I want to get the Pipeline object, which created when scrapy started.

Run Code Online (Sandbox Code Playgroud)

好吧，就是不想新的对象。。。我承认我是强迫症。

Answer 1

小智 4

Scrapy Pipeline 有一个open_spider方法，该方法在蜘蛛初始化后执行。您可以将对数据库连接、 get_date() 方法或管道本身的引用传递给您的蜘蛛。后者与您的代码的示例是：

# This is my Pipline
class MongoDBPipeline(object):
    def __init__(self, mongodb_db=None, mongodb_collection=None):
        self.connection = pymongo.Connection(settings['MONGODB_SERVER'], settings['MONGODB_PORT'])
        ....

    def process_item(self, item, spider):
        ....
    def get_date(self):
        ....

    def open_spider(self, spider):
        spider.myPipeline = self

Run Code Online (Sandbox Code Playgroud)

然后，在蜘蛛中：

class Spider(Spider):
    name = "test"

    def __init__(self):
        self.myPipeline = None

    def parse(self, response):
        self.myPipeline.get_date()

Run Code Online (Sandbox Code Playgroud)

我认为这里没有__init__()必要使用该方法，但我将其放在这里是为了表明 open_spider 在初始化后替换了它。

归档时间：	11 年，7 月前
查看次数：	1945 次
最近记录：	9 年，6 月前