Emi*_*ebb 4 java web-scraping jsoup crawler4j
我目前在应用程序中使用jsoup来解析和分析网页.但我想确保我遵守robot.txt规则并且只访问允许的页面.
我很确定jsoup不是为此制作的,而是关于网页抓取和解析.所以我打算让函数/模块读取域/站点的robot.txt,并确定我是否允许访问的URL.
我做了一些研究,发现了以下内容.但我不确定这些,所以如果有人做同样的项目,其中涉及到robot.txt解析请分享你的想法和想法会很棒.
http://sourceforge.net/projects/jrobotx/
https://code.google.com/p/crawler-commons/
http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12
一个迟到的答案,以防你 - 或其他人 - 仍然在寻找一种方法来做到这一点.我在版本0.2中使用https://code.google.com/p/crawler-commons/,它似乎运行良好.以下是我使用的代码的简化示例:
String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
+ (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
HttpGet httpget = new HttpGet(hostId + "/robots.txt");
HttpContext context = new BasicHttpContext();
HttpResponse response = httpclient.execute(httpget, context);
if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
// consume entity to deallocate connection
EntityUtils.consumeQuietly(response.getEntity());
} else {
BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
"text/plain", USER_AGENT);
}
robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
Run Code Online (Sandbox Code Playgroud)
显然,这与Jsoup没有任何关系,只是检查是否允许为某个USER_AGENT抓取给定的URL.为了获取robots.txt,我在版本4.2.1中使用Apache HttpClient,但这也可以用java.net的东西代替.
请注意,此代码仅检查允许或拒绝,并且不考虑其他robots.txt功能,如"抓取延迟".但是,由于crawler-commons也提供此功能,因此可以轻松地将其添加到上面的代码中.