专门用于下载图像和文件的网络爬虫

Question

专门用于下载图像和文件的网络爬虫

Emm*_*ohn 2 java web-crawler html-parsing jsoup

我正在为我的一门课做作业。

我应该编写一个网络爬虫，从给定指定爬网深度的网站下载文件和图像。

我被允许使用第三方解析 api 所以我使用的是Jsoup。我也试过htmlparser。两款软件都不错，但都不完美。

我在处理 url 之前使用默认的 java URLConnection来检查内容类型，但随着链接数量的增加，它变得非常慢。

问题：有人知道任何专门的图像和链接解析器 api 吗？

我可以开始使用 Jsoup 编写我的代码，但我很懒惰。此外，如果有一个可行的解决方案，为什么要重新发明轮子呢？任何帮助，将不胜感激。

我需要在循环链接时检查 contentType，以有效的方式检查链接是否指向文件，但 Jsoup 没有我需要的东西。这是我所拥有的：**

    HttpConnection mimeConn =null;
    Response mimeResponse = null;
    for(Element link: links){

        String linkurl =link.absUrl("href");
        if(!linkurl.contains("#")){

            if(DownloadRepository.curlExists(link.absUrl("href"))){
                continue;
            }

            mimeConn = (HttpConnection) Jsoup.connect(linkurl);
            mimeConn.ignoreContentType(true);
            mimeConn.ignoreHttpErrors(true);
            mimeResponse =(Response) mimeConn.execute();

            WebUrl webUrl = new WebUrl(linkurl,currentDepth+1);
            String contentType = mimeResponse.contentType();

            if(contentType.contains("html")){
                page.addToCrawledPages(new WebPage(webUrl));
            }else if(contentType.contains("image")){                    
                page.addToImages(new WebImage(webUrl));
            }else{
                page.addToFiles(new WebFile(webUrl));
            }

            DownloadRepository.addCrawledURL(linkurl);

        }**

Run Code Online (Sandbox Code Playgroud)

更新基于 Yoshi 的回答，我能够让我的代码正常工作。这是链接：

https://github.com/unekwu/cs_nemesis/blob/master/crawler/crawler/src/cu/cs/cpsc215/project1/parser/Parser.java

Answer 1

Ish*_*shi 5

使用jSoup我认为这个 API 足以满足您的目的。你也可以在这个网站上找到好的食谱。

几个步骤：

Jsoup：如何获取图像的绝对网址？
如何从java中的任何网页下载图像
您可以编写自己的递归方法，遍历包含必要域名或相关链接的页面上的链接。使用这种方式抓取所有链接并在其上查找所有图像。自己写吧，这不是坏习惯。

你不需要使用 URLConnection 类，jSoup 有它的包装器。

例如

可以只用一行代码来获取DOM对象：

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

Run Code Online (Sandbox Code Playgroud)

而不是这个代码：

    URL oracle = new URL("http://www.oracle.com/");
    URLConnection yc = oracle.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(
                                yc.getInputStream()));
    String inputLine;
    while ((inputLine = in.readLine()) != null) 
        System.out.println(inputLine);
    in.close();

Run Code Online (Sandbox Code Playgroud)

Update1 尝试在您的代码中添加下一行：

Connection.Response res = Jsoup.connect("http://en.wikipedia.org/").execute();
String pageContentType = res.contentType();

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年前
查看次数：	7860 次
最近记录：	11 年，10 月前