cif*_*key 56 python unit-testing nose scrapy
我想在Scrapy(屏幕抓取器/网络爬虫)中实现一些单元测试.由于项目是通过"scrapy crawl"命令运行的,我可以通过像nose这样的东西运行它.由于scrapy是建立在扭曲的基础之上,我可以使用它的单元测试框架试用吗?如果是这样,怎么样?否则,我想获得的鼻子工作.
更新:
我一直在谈论Scrapy-Users,我想我应该"在测试代码中构建Response,然后使用响应调用该方法并声明[I]在输出中获得预期的项目/请求".我似乎无法让这个工作.
我可以构建一个单元测试测试类并进行测试:
然而,它最终会产生这种追溯.任何洞察力为什么?
Sam*_*nga 64
我这样做的方法是创建假响应,这样你就可以离线测试解析函数.但是你通过使用真实的HTML来获得真实的情况.
此方法的一个问题是您的本地HTML文件可能无法在线反映最新状态.因此,如果HTML在线更改,您可能会遇到一个大错误,但您的测试用例仍然会通过.所以它可能不是测试这种方式的最佳方式.
我目前的工作流程是,每当出现错误时,我都会通过网址向管理员发送电子邮件.然后,对于该特定错误,我创建一个html文件,其中包含导致错误的内容.然后我为它创建了一个单元测试.
这是我用来创建示例Scrapy http响应的代码,用于从本地html文件进行测试:
# scrapyproject/tests/responses/__init__.py
import os
from scrapy.http import Response, Request
def fake_response_from_file(file_name, url=None):
"""
Create a Scrapy fake HTTP response from a HTML file
@param file_name: The relative filename from the responses directory,
but absolute paths are also accepted.
@param url: The URL of the response.
returns: A scrapy HTTP response which can be used for unittesting.
"""
if not url:
url = 'http://www.example.com'
request = Request(url=url)
if not file_name[0] == '/':
responses_dir = os.path.dirname(os.path.realpath(__file__))
file_path = os.path.join(responses_dir, file_name)
else:
file_path = file_name
file_content = open(file_path, 'r').read()
response = Response(url=url,
request=request,
body=file_content)
response.encoding = 'utf-8'
return response
Run Code Online (Sandbox Code Playgroud)
示例html文件位于scrapyproject/tests/responses/osdir/sample.html中
然后测试用例可能如下所示:测试用例位置为scrapyproject/tests/test_osdir.py
import unittest
from scrapyproject.spiders import osdir_spider
from responses import fake_response_from_file
class OsdirSpiderTest(unittest.TestCase):
def setUp(self):
self.spider = osdir_spider.DirectorySpider()
def _test_item_results(self, results, expected_length):
count = 0
permalinks = set()
for item in results:
self.assertIsNotNone(item['content'])
self.assertIsNotNone(item['title'])
self.assertEqual(count, expected_length)
def test_parse(self):
results = self.spider.parse(fake_response_from_file('osdir/sample.html'))
self._test_item_results(results, 10)
Run Code Online (Sandbox Code Playgroud)
这基本上是我测试我的解析方法,但它不仅用于解析方法.如果它变得更复杂,我建议看看Mox
Had*_*ien 16
我使用Betamax第一次在真实网站上运行测试并在本地保留http响应,以便下一次测试在以下情况下超快速运行:
Betamax拦截您提出的每个请求,并尝试查找已被拦截和记录的匹配请求.
当您需要获取最新版本的网站时,只需删除betamax记录的内容并重新运行测试.
例:
from scrapy import Spider, Request
from scrapy.http import HtmlResponse
class Example(Spider):
name = 'example'
url = 'http://doc.scrapy.org/en/latest/_static/selectors-sample1.html'
def start_requests(self):
yield Request(self.url, self.parse)
def parse(self, response):
for href in response.xpath('//a/@href').extract():
yield {'image_href': href}
# Test part
from betamax import Betamax
from betamax.fixtures.unittest import BetamaxTestCase
with Betamax.configure() as config:
# where betamax will store cassettes (http responses):
config.cassette_library_dir = 'cassettes'
config.preserve_exact_body_bytes = True
class TestExample(BetamaxTestCase): # superclass provides self.session
def test_parse(self):
example = Example()
# http response is recorded in a betamax cassette:
response = self.session.get(example.url)
# forge a scrapy response to test
scrapy_response = HtmlResponse(body=response.content, url=example.url)
result = example.parse(scrapy_response)
self.assertEqual({'image_href': u'image1.html'}, result.next())
self.assertEqual({'image_href': u'image2.html'}, result.next())
self.assertEqual({'image_href': u'image3.html'}, result.next())
self.assertEqual({'image_href': u'image4.html'}, result.next())
self.assertEqual({'image_href': u'image5.html'}, result.next())
with self.assertRaises(StopIteration):
result.next()
Run Code Online (Sandbox Code Playgroud)
仅供参考,由于Ian Cordasco的演讲,我在2015年的pycon上发现了betamax .
这是一个很晚的答案,但我对scrapy测试感到恼火,所以我编写了scrapy-test一个框架,用于根据定义的规范测试scrapy爬虫。
它通过定义测试规范而不是静态输出来工作。例如,如果我们正在抓取此类项目:
{
"name": "Alex",
"age": 21,
"gender": "Female",
}
Run Code Online (Sandbox Code Playgroud)
我们可以定义scrapy-test ItemSpec:
from scrapytest.tests import Match, MoreThan, LessThan
from scrapytest.spec import ItemSpec
class MySpec(ItemSpec):
name_test = Match('{3,}') # name should be at least 3 characters long
age_test = Type(int), MoreThan(18), LessThan(99)
gender_test = Match('Female|Male')
Run Code Online (Sandbox Code Playgroud)
对于scrapy stats也有相同的想法测试StatsSpec:
from scrapytest.spec import StatsSpec
from scrapytest.tests import Morethan
class MyStatsSpec(StatsSpec):
validate = {
"item_scraped_count": MoreThan(0),
}
Run Code Online (Sandbox Code Playgroud)
之后,它可以针对实时或缓存的结果运行:
$ scrapy-test
# or
$ scrapy-test --cache
Run Code Online (Sandbox Code Playgroud)
我一直在运行缓存运行以进行开发更改和每日 cronjobs 以检测网站更改。
我使用 Twistedtrial来运行测试,类似于 Scrapy 自己的测试。它已经启动了一个反应堆,因此我可以使用它,而CrawlerRunner不必担心在测试中启动和停止一个反应堆。
check从和Scrapy 命令中窃取一些想法,parse我最终得到了以下基TestCase类来针对实时站点运行断言:
from twisted.trial import unittest
from scrapy.crawler import CrawlerRunner
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.spider import iterate_spider_output
class SpiderTestCase(unittest.TestCase):
def setUp(self):
self.runner = CrawlerRunner()
def make_test_class(self, cls, url):
"""
Make a class that proxies to the original class,
sets up a URL to be called, and gathers the items
and requests returned by the parse function.
"""
class TestSpider(cls):
# This is a once used class, so writing into
# the class variables is fine. The framework
# will instantiate it, not us.
items = []
requests = []
def start_requests(self):
req = super(TestSpider, self).make_requests_from_url(url)
req.meta["_callback"] = req.callback or self.parse
req.callback = self.collect_output
yield req
def collect_output(self, response):
try:
cb = response.request.meta["_callback"]
for x in iterate_spider_output(cb(response)):
if isinstance(x, (BaseItem, dict)):
self.items.append(x)
elif isinstance(x, Request):
self.requests.append(x)
except Exception as ex:
print("ERROR", "Could not execute callback: ", ex)
raise ex
# Returning any requests here would make the crawler follow them.
return None
return TestSpider
Run Code Online (Sandbox Code Playgroud)
例子:
@defer.inlineCallbacks
def test_foo(self):
tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(tester)
self.assertEqual(len(tester.items), 1)
self.assertEqual(len(tester.requests), 2)
Run Code Online (Sandbox Code Playgroud)
或者在设置中执行一个请求并针对结果运行多个测试:
@defer.inlineCallbacks
def setUp(self):
super(FooTestCase, self).setUp()
if FooTestCase.tester is None:
FooTestCase.tester = self.make_test_class(FooSpider, 'https://foo.com')
yield self.runner.crawl(self.tester)
def test_foo(self):
self.assertEqual(len(self.tester.items), 1)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
17407 次 |
| 最近记录: |