使用java解析robot.txt并确定是否允许使用url

Emi*_*ebb 4 java web-scraping jsoup crawler4j

我目前在应用程序中使用jsoup来解析和分析网页.但我想确保我遵守robot.txt规则并且只访问允许的页面.

我很确定jsoup不是为此制作的,而是关于网页抓取和解析.所以我打算让函数/模块读取域/站点的robot.txt,并确定我是否允许访问的URL.

我做了一些研究,发现了以下内容.但我不确定这些,所以如果有人做同样的项目,其中涉及到robot.txt解析请分享你的想法和想法会很棒.

http://sourceforge.net/projects/jrobotx/

https://code.google.com/p/crawler-commons/

http://code.google.com/p/crowl/source/browse/trunk/Crow/src/org/crow/base/Robotstxt.java?r=12

kor*_*rpe 6

一个迟到的答案,以防你 - 或其他人 - 仍然在寻找一种方法来做到这一点.我在版本0.2中使用https://code.google.com/p/crawler-commons/,它似乎运行良好.以下是我使用的代码的简化示例:

String USER_AGENT = "WhateverBot";
String url = "http://www.....com/";
URL urlObj = new URL(url);
String hostId = urlObj.getProtocol() + "://" + urlObj.getHost()
                + (urlObj.getPort() > -1 ? ":" + urlObj.getPort() : "");
Map<String, BaseRobotRules> robotsTxtRules = new HashMap<String, BaseRobotRules>();
BaseRobotRules rules = robotsTxtRules.get(hostId);
if (rules == null) {
    HttpGet httpget = new HttpGet(hostId + "/robots.txt");
    HttpContext context = new BasicHttpContext();
    HttpResponse response = httpclient.execute(httpget, context);
    if (response.getStatusLine() != null && response.getStatusLine().getStatusCode() == 404) {
        rules = new SimpleRobotRules(RobotRulesMode.ALLOW_ALL);
        // consume entity to deallocate connection
        EntityUtils.consumeQuietly(response.getEntity());
    } else {
        BufferedHttpEntity entity = new BufferedHttpEntity(response.getEntity());
        SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
        rules = robotParser.parseContent(hostId, IOUtils.toByteArray(entity.getContent()),
                "text/plain", USER_AGENT);
    }
    robotsTxtRules.put(hostId, rules);
}
boolean urlAllowed = rules.isAllowed(url);
Run Code Online (Sandbox Code Playgroud)

显然,这与Jsoup没有任何关系,只是检查是否允许为某个USER_AGENT抓取给定的URL.为了获取robots.txt,我在版本4.2.1中使用Apache HttpClient,但这也可以用java.net的东西代替.

请注意,此代码仅检查允许或拒绝,并且不考虑其他robots.txt功能,如"抓取延迟".但是,由于crawler-commons也提供此功能,因此可以轻松地将其添加到上面的代码中.