Scrapy:对数据库的多个返回/项目的建议

DMM*_*MML 10 pipeline multiple-tables scrapy

为了进一步详细说明这个问题的标题:我正在从电影网站上删除信息.我目前有一个MySQL填充了movie titles,movie urls等等的数据库.我现在urls要从数据库中取出它们并将它们设置为我start_urls的新数据库spider.每个url都是[插入abritrary电影]网页的链接,传达更多信息.我感兴趣的信息是:

  • 经销商(即福克斯)
  • 评级(即Pg-13)
  • 导向器
  • 流派(即喜剧)
  • 演员
  • 生产者/秒

其中,发行人,评级,导演和流派将从每个电影网页(一个评级,一个导演等)中与他们相关联.当然会有多个演员,并且取决于多个制片人(大片电影/大多数电影).这是我遇到问题的地方.我想建立一个pipeline' which puts each piece of info in an appropriatewithin myMySQL database. So, a table for director, a table for rating, etc. Each table will also have电影标题`.我可以这样说明问题本身:

我无法协调如何pipeline用适当的方法构建一个合适的spider.我不确定我是否可以从一个蜘蛛返回多个东西并将它们发送到不同的pipelines(创建不同的项来处理single属性,以及一个不同的项来处理'多个'属性)或是否使用相同的管道并以某种方式指定什么去哪里(不确定我是否只能在刮痧后返回一件事).我将展示我的代码,希望问题会变得更加清晰.*注意:它还没有完成 - 我只是想填写如何做到这一点的空白

蜘蛛:

  class ActorSpider(BaseSpider):
  import sys; sys.path.append("/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages")
  import MySQLdb
  db = MySQLdb.connect(db = 'testdb', user='testuser', passwd='test')
  dbc = db.cursor()
  name = 'ActorSpider'
  allowed_domains = ['movie website']
  #start_urls = #HAVE NOT FILLED THIS IN YET- WILL BE A SELECT STATEMENT, GATHERING ALL URLS

  def parse(self, response):

      hxs = HtmlXPathSelector(response)

      #Expect only singular items (ie. one title, one rating, etc.)

      single_info = SingleItem()
      title = hxs.select('[title tags here]').extract()
      distributor = hxs.select('[distributor tags here]').extract()
      rating = hxs.select('[rating tags here]').extract()
      director = hxs.select('[director tags here]').extract()
      genre = hxs.select('[genre tags here]').extract()

      single_items = []
      single_info['title'] = title
      single_info['distributor'] = distributor
      single_info['rating'] = rating
      single_info['director'] = director
      single_info['genre'] = genre        
      single_items.append(single_info) #Note: not sure if I want to return this or the single info

      #return single_items


      #Multiple items in a field

      multi_info = MultiItem()
      actors = hxs.select('[actor tags here]').extract()
      producers = hxs.select('[producer tags here]').extract()

      actor_items= []
      for i in range(len(actors)):
          multi_info['title'] = title
          multi_info['actor'] = actors[i]
          actor_items.append(multi_info)

     #return actor_items - can I have multiple returns in my code to specify which pipeline is used, or which table this should be inserted into

      producer_items = []
      for i in range(len(producers)):
          multi_info['title'] = title
          multi_info['producer'] = producers[i]
          producer_items.append(multi_info)
      #return producer_items - same issue - are multiple returns allowed? Should I try to put both the 'single items' and 'multiple items' in on big 'items' list?  Can scrapy figure that out or how would I go about specifying?
Run Code Online (Sandbox Code Playgroud)

我已经评论了许多可能不清楚的问题 - 我不确定如何指导所有问题,以便它最终出现在适当的表格中.当您阅读管道时,这可能会更清楚,即:

 class IndMoviePipeline(object):

     def __init__(self):
        'initiate the database connnection'
        self.conn = MySQLdb.connect(user='testuser', passwd='test', db='testdb', host='localhost', charset='utf8', use_unicode=True)
        self.cursor = self.conn.cursor()

    def process_item(self, item, spider):

         try:
             if 'producer' in item:
                  self.cursor.execute("""INSERT INTO Producers (title, producer) VALUES (%s, %s)""", (item['title'], item['producer']))
             elif 'actor' in item:
                  self.cursor.execute("""INSERT INTO Actors (title, actor) VALUES (%s, %s)""", (item['title'], item['actor']))
             else:
                  self.cursor.execute("""INSERT INTO Other_Info (title, distributor, rating, director, genre) VALUES (%s, %s, %s, %s, %s)""", (item['title'], item['distributor'], item['rating'], item['director'], item['genre'])) #NOTE: I will likely change 'Other_Info' table to just populating the original table from which the URLS will be pulled
             self.conn.commit()
         except MySQLdb.Error, e:
             print "Error %d: %s" % (e.args[0], e.args[1])

         return item
Run Code Online (Sandbox Code Playgroud)

我认为这将有助于指导数据库中item的相应内容table.基于此,我认为有一个大的列表items并将其附加到它上面是有用的,所以:

 items = []
 items.append(single_info)

 for i in range(len(producers)):
      multi_info['title'] = title
      multi_info['producer'] = producers[i]
      items.append(multi_info)

 for i in range(len(actors)):
      multi_info['title'] = title
      multi_info['actor'] = actors[i]
      items.append(multi_info)
Run Code Online (Sandbox Code Playgroud)

只是让pipeline这些if陈述完全排除.不过,我不确定这是否是最好的方法,并且非常感谢建议.

aud*_*ude 13

从概念上讲,scrapy项目通常指的是被抓取的单个"东西"(在您的情况下是电影),并且具有表示构成该"事物"的数据的字段.所以考虑:

class MovieItem(scrapy.item.Item):
  title = Field()
  director = Field()
  actors = Field()
Run Code Online (Sandbox Code Playgroud)

然后当你刮掉物品时:

item = MovieItem()

title = hxs.select('//some/long/xpath').extract()
item['title'] = title

actors = hxs.select('//some/long/xpath').extract()
item['actors'] = actors

return item
Run Code Online (Sandbox Code Playgroud)

Spider解析方法应该总是返回或产生scrapy.item.Item对象或scrapy.http.Request对象.

从那里开始,您如何处理MovieItems取决于您.您可以为MovieItem的每个属性设置一个管道,但不建议这样做.我建议改为使用一个MySQLPersistancePipeline对象,该对象具有持久化MovieItem的每个字段的方法.所以类似于:

class MySQLPersistancePipeline(object):
  ...
  def persist_producer(self, item):
    self.cursor.execute('insert into producers ...', item['producer'])

  def persist_actors(self, item):
    for actor in item['actors']:
      self.cursor.execute('insert into actors ...', actor)

  def process_item(self, item, spider):
    persist_producer(item)
    persist_actors(item)
    return item
Run Code Online (Sandbox Code Playgroud)