Crawler4j,一些网址被抓取没有问题,而其他网址根本没有被抓取.

Question

Crawler4j,一些网址被抓取没有问题,而其他网址根本没有被抓取.

the*_*y05 3 java web-crawler google-crawlers crawler4j

我一直在玩Crawler4j并成功地抓取了一些页面但没有成功抓取其他页面.例如,我已经使用此代码成功抓取Reddi:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
        int numberOfCrawlers = 1;

        CrawlConfig config = new CrawlConfig();
       config.setCrawlStorageFolder(crawlStorageFolder);

        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("https://www.reddit.com/r/movies");
        controller.addSeed("https://www.reddit.com/r/politics");


        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }


}

Run Code Online (Sandbox Code Playgroud)

与:

@Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.reddit.com/");
 }

Run Code Online (Sandbox Code Playgroud)

在MyCrawler.java中.但是,当我尝试抓取http://www.ratemyprofessors.com/时,程序只挂起而没有输出并且不会抓取任何内容.我在myController.java中使用上面的代码:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

Run Code Online (Sandbox Code Playgroud)

在MyCrawler.java中:

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("http://www.ratemyprofessors.com/");
 }

Run Code Online (Sandbox Code Playgroud)

所以我想知道:

某些服务器是否能够立即识别爬虫并且不允许它们收集数据？
我注意到RateMyProfessor页面是.jsp格式; 这跟它有什么关系吗？
有什么方法可以更好地调试这个吗？控制台不输出任何内容.

Answer 1

rzo*_*rzo 5

crawler4j尊重履带式政治,如robots.txt.在你的情况下,该文件是以下一个.

检查此文件显示,不允许抓取您给定的种子点:

 Disallow: /ShowRatings.jsp 
 Disallow: /campusRatings.jsp

Run Code Online (Sandbox Code Playgroud)

crawler4j日志输出支持该理论:

2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，11 月前
查看次数：	850 次
最近记录：	6 年，6 月前