Django与Scrapy的关系如何保存物品？

Question

Django与Scrapy的关系如何保存物品？

Mur*_*aya 5 python django scrapy scrapy-spider scrapy-pipeline

我只需要了解如何检测scrapy是否已保存以及蜘蛛中的项目？我正在从网站上获取项目,之后我正在获取该项目的评论.所以首先我必须保存项目,之后我会保存评论.但是当我在编写代码之后编写代码时,它会给我这个错误.

save() prohibited to prevent data loss due to unsaved related object ''.

这是我的代码

def parseProductComments(self, response):

        name = response.css('h1.product-name::text').extract_first()
        price = response.css('span[id=offering-price] > span::text').extract_first()
        node = response.xpath("//script[contains(text(),'var utagData = ')]/text()")
        data = node.re('= (\{.+\})')[0]  #data = xpath.re(" = (\{.+\})")
        data = json.loads(data)

        barcode = data['product_barcode']

        objectImages = []
        for imageThumDiv in response.css('div[id=productThumbnailsCarousel]'):
            images = imageThumDiv.xpath('img/@data-src').extract()
            for image in images:
                imageQuality = image.replace('/80/', '/500/')
                objectImages.append(imageQuality)
        company = Company.objects.get(pk=3)
        comments = []
        item = ProductItem(name=name, price=price, barcode=barcode, file_urls=objectImages, product_url=response.url,product_company=company, comments = comments)
        yield item
        print item["pk"]
        for commentUl in response.css('ul.chevron-list-container'):

            url = commentUl.css('span.link-more-results::attr(href)').extract_first()
            if url is not None:
                for commentLi in commentUl.css('li.review-item'):
                    comment = commentLi.css('p::text').extract_first()
                    commentItem = CommentItem(comment=comment, product=item.instance)

                    yield commentItem
            else:

                yield scrapy.Request(response.urljoin(url), callback=self.parseCommentsPages, meta={'item': item.instance})

Run Code Online (Sandbox Code Playgroud)

这是我的管道.

def comment_to_model(item):
    model_class = getattr(item, 'Comment')
    if not model_class:
        raise TypeError("Item is not a `DjangoItem` or is misconfigured")

def get_comment_or_create(model):
    model_class = type(model)
    created = False
    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(product=model.product, comment=model.comment)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.
        obj.save()

    return (obj, created)

def get_or_create(model):
    model_class = type(model)
    created = False
    # Normally, we would use `get_or_create`. However, `get_or_create` would
    # match all properties of an object (i.e. create a new object
    # anytime it changed) rather than update an existing object.
    #
    # Instead, we do the two steps separately
    try:
        # We have no unique identifier at the moment; use the name for now.
        obj = model_class.objects.get(product_company=model.product_company, barcode=model.barcode)
    except model_class.DoesNotExist:
        created = True
        obj = model  # DjangoItem created a model for us.
        obj.save()

    return (obj, created)


def update_model(destination, source, commit=True):
    pk = destination.pk

    source_dict = model_to_dict(source)
    for (key, value) in source_dict.items():
        setattr(destination, key, value)

    setattr(destination, 'pk', pk)

    if commit:
        destination.save()
    return destination


class ProductItemPipeline(object):
    def process_item(self, item, spider):
        if isinstance(item, ProductItem):
            item['cover_photo'] = item['files'][0]['path']
            item_model = item.instance
            model, created = get_or_create(item_model)
            #update_model(model, item_model)

            if created:
                for image in item['files']:
                    imageItem = ProductImageItem(image=image['path'], product=item.instance)
                    imageItem.save()
                # for comment in item['comments']:
                #     commentItem = CommentItem(comment=comment, product= item.instance)
                #     commentItem.save()
            return item
        if isinstance(item, CommentItem):
            comment_to_model = item.instance
            model, created = get_comment_or_create(comment_to_model)
            if created:
                print model
            else:
                print created
            return item

Run Code Online (Sandbox Code Playgroud)

Answer 1

e4c*_*4c5 2

获取或创建

\n\n

您的代码的很大一部分似乎是在处理 get_or_create 的明显弱点

\n\n

# Normally, we would use `get_or_create`. However, `get_or_create` would\n# match all properties of an object (i.e. create a new object\n# anytime it changed) rather than update an existing object.\n

Run Code Online (Sandbox Code Playgroud)\n\n

幸运的是，这个明显的缺点很容易克服。感谢get_or_create的默认参数

\n\n

\n
传递给 get_or_create() \xe2\x80\x94 的任何关键字参数（称为 defaults \xe2\x80\x94 的可选参数除外）都将在 get() 调用中使用。如果找到一个对象，get_or_create() 返回该对象的元组和 False。如果找到多个对象，则 get_or_create 引发 MultipleObjectsReturned。如果未找到对象，get_or_create()\n 将实例化并保存一个新对象，返回新对象和 True 的元组。
\n

\n\n

更新或创建

\n\n

仍然不相信 get_or_create 是这项工作的合适人选？我也不是。还有更好的东西。更新或创建！

\n\n

\n
一种使用给定 kwargs 更新对象的便捷方法，\n 如果需要则创建一个新对象。默认值是用于更新对象的（字段，值）对的字典。
\n

\n\n

但我不会详细讨论 update_or_create 的用户，因为代码中尝试更新模型的行已被注释掉，并且您没有明确说明要更新的内容。

\n\n

新管道

\n\n

使用标准 API 方法，包含管道的模块只需简化为 ProductItemPipeline 类。并且可以修改

\n\n

class ProductItemPipeline(object):\n    def process_item(self, item, spider):\n        if isinstance(item, ProductItem):\n            item[\'cover_photo\'] = item[\'files\'][0][\'path\']\n\n            model, created = ProductItem.get_or_create(product_company=item[\'product_company\'], barcode=item[\'bar_code\'], \n    defaults={\'Other_field1\': value1, \'Other_field2\': value2})\n\n            if created:\n                for image in item[\'files\']:\n                    imageItem = ProductImageItem(image=image[\'path\'], product=item.instance)\n                    imageItem.save()\n            return item\n\n        if isinstance(item, CommentItem):\n\n            model, created = CommentItem.get_or_create(field1=value1, defaults={ other fields go in here\'})\n\n            if created:\n                print model\n            else:\n                print created\n            return item\n

Run Code Online (Sandbox Code Playgroud)\n\n

原始代码中的错误

\n\n

我确实相信这是错误存在的地方。

\n\n

  obj = model_class.objects.get(product=model.product, comment=model.comment)\n

Run Code Online (Sandbox Code Playgroud)\n\n

现在我们不再使用它，因此错误应该消失。如果您仍然遇到问题，请粘贴完整的回溯。

\n

归档时间：	9 年前
查看次数：	533 次
最近记录：	9 年前