Raf*_*cho 5 python scrapy scrapyd
我想知道如何忽略不填充所有字段的项目,某种丢弃,因为在scrapyd的输出中我得到的页面没有填满所有字段.
我有那个代码:
class Product(scrapy.Item):
source_url = scrapy.Field(
output_processor = TakeFirst()
)
name = scrapy.Field(
input_processor = MapCompose(remove_entities),
output_processor = TakeFirst()
)
initial_price = scrapy.Field(
input_processor = MapCompose(remove_entities, clear_price),
output_processor = TakeFirst()
)
main_image_url = scrapy.Field(
output_processor = TakeFirst()
)
Run Code Online (Sandbox Code Playgroud)
分析器:
def parse_page(self, response):
try:
l = ItemLoader(item=Product(), response=response)
l.add_value('source_url', response.url)
l.add_css('name', 'h1.title-product::text')
l.add_css('main_image_url', 'div.pics a img.zoom::attr(src)')
l.add_css('initial_price', 'ul.precos li.preco_normal::text')
l.add_css('initial_price', 'ul.promocao li.preco_promocao::text')
return l.load_item()
except Exception as e:
print self.log("#1 ERRO: %s" % e), response.url
Run Code Online (Sandbox Code Playgroud)
我想用Loader做它而不需要用我自己的Selector创建(避免两次处理项目).我想我可以将它们放到管道中,但可能不是最好的方法,因为这些项目无效.
验证数据是管道的典型用例之一.在您的情况下,您只需要编写一些少量代码来检查必需的字段,类似于:
from scrapy.exceptions import DropItem
class YourPersonalPipeline(object):
def process_item(self, item, spider):
required_fields = [] # your list of required fields
if all(field in item for field in required_fields):
return item
else:
raise DropItem("your reason")
Run Code Online (Sandbox Code Playgroud)
您需要在settings.py中启用管道.阅读scrapy docs中的更多内容.
| 归档时间: |
|
| 查看次数: |
1905 次 |
| 最近记录: |