6 python robots.txt web-crawler python-2.7
我正在编写一个爬虫,为此我正在实现robots.txt解析器,我正在使用标准的lib robotparser.
看来,robotparser是不解析正确,我使用谷歌的调试我的履带式的robots.txt.
(以下示例来自IPython)
In [1]: import robotparser
In [2]: x = robotparser.RobotFileParser()
In [3]: x.set_url("http://www.google.com/robots.txt")
In [4]: x.read()
In [5]: x.can_fetch("My_Crawler", "/catalogs") # This should return False, since it's on Disallow
Out[5]: False
In [6]: x.can_fetch("My_Crawler", "/catalogs/p?") # This should return True, since it's Allowed
Out[6]: False
In [7]: x.can_fetch("My_Crawler", "http://www.google.com/catalogs/p?")
Out[7]: False
Run Code Online (Sandbox Code Playgroud)
这很有趣,因为有时似乎"工作",有时似乎失败了,我也尝试过来自Facebook和Stackoverflow的robots.txt.这是robotpaser
模块中的错误吗?或者我在这里做错了什么?如果是这样,什么?
我想知道这个 bug是否有任何相关之处
小智 3
经过几次谷歌搜索后,我没有找到任何有关robotsparser问题的信息。我最终得到了其他东西,我发现了一个名为reppy的模块,我做了一些测试,它看起来非常强大。您可以通过pip安装它;
pip install reppy
Run Code Online (Sandbox Code Playgroud)
这里有一些使用reppy的例子(在IPython上) ,再次使用Google的robots.txt
In [1]: import reppy
In [2]: x = reppy.fetch("http://google.com/robots.txt")
In [3]: x.atts
Out[3]:
{'agents': {'*': <reppy.agent at 0x1fd9610>},
'sitemaps': ['http://www.gstatic.com/culturalinstitute/sitemaps/www_google_com_culturalinstitute/sitemap-index.xml',
'http://www.google.com/hostednews/sitemap_index.xml',
'http://www.google.com/sitemaps_webmasters.xml',
'http://www.google.com/ventures/sitemap_ventures.xml',
'http://www.gstatic.com/dictionary/static/sitemaps/sitemap_index.xml',
'http://www.gstatic.com/earth/gallery/sitemaps/sitemap.xml',
'http://www.gstatic.com/s2/sitemaps/profiles-sitemap.xml',
'http://www.gstatic.com/trends/websites/sitemaps/sitemapindex.xml']}
In [4]: x.allowed("/catalogs/about", "My_crawler") # Should return True, since it's allowed.
Out[4]: True
In [5]: x.allowed("/catalogs", "My_crawler") # Should return False, since it's not allowed.
Out[5]: False
In [7]: x.allowed("/catalogs/p?", "My_crawler") # Should return True, since it's allowed.
Out[7]: True
In [8]: x.refresh() # Refresh robots.txt, perhaps a magic change?
In [9]: x.ttl
Out[9]: 3721.3556718826294
In [10]: # It also has a x.disallowed function. The contrary of x.allowed
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
2455 次 |
最近记录: |