Gus*_*lho 3 python unit-testing contracts web-crawler scrapy
What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response? I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
@url http://foo.bar
@returns items 1 1
@scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
Run Code Online (Sandbox Code Playgroud)
In the example, I put the contract in the second method, but it gave me a KeyError exception on response.request.meta['item'], also, the fields field1 and field2 are populated in the first method.
希望它足够清楚。
坦率地说,我不使用 Scrapy 合约,我也不真正推荐任何人使用它们。他们有很多问题,有一天可能会从 Scrapy 中删除。
在实践中,我对蜘蛛进行单元测试的运气并不好。
为了在开发过程中测试蜘蛛,我会启用缓存,然后根据需要多次重新运行蜘蛛以获得正确的抓取。
对于回归错误,我使用项目管道(或蜘蛛中间件)会更好地进行动态验证(无论如何,在早期测试中你只能捕捉到这么多)。制定一些恢复策略也是一个好主意。
为了维护一个健康的代码库,我会不断地将类似库的代码从蜘蛛本身中移出,以使其更易于测试。
抱歉,如果这不是您要找的答案。
| 归档时间: |
|
| 查看次数: |
1353 次 |
| 最近记录: |