小编sum*_*mit的帖子

Jsoup获取部分页面

我试图抓取出价网站的内容,但无法获取该网站的完整页面.我在xulrunner上使用crowbar首先获取页面(因为ajax以懒惰方式加载某些元素)然后从文件中删除.但是在bidrivals网站的主页上,即使本地文件格式正确,也会失败.jSoup似乎只是在html代码中途以'...'字符结束.如果有人以前遇到过此,请帮忙.为[ 此链接 ] 调用以下代码.

File f = new File(projectLocation+logFile+"bidrivalsHome");
    try {
        f.createNewFile();
        log.warn("Trying to fetch mainpage through a console.");
        WinRedirect.redirect(projectLocation+"Curl.exe -s --data \"url="+website+"&delay="+timeDelay+"\" http://127.0.0.1:10000", projectLocation, logFile+"bidrivalsHome");
    } catch (Exception e) {
        e.printStackTrace();
        log.warn("Error in fetching the nameList", e);
    }
    Document doc = new Document("");
    try {
        doc = Jsoup.parse(f, "UTF-8", website);
    } catch (IOException e1) {
        System.out.println("Error while parsing the document.");
        e1.printStackTrace();
        log.warn("Error in parsing homepage", e1);
    }

Run Code Online (Sandbox Code Playgroud)

java web-scraping jsoup

sum*_*mit

2016 04-12

8
推荐指数

1
解决办法

1497
查看次数

标签统计

java ×1

jsoup ×1

web-scraping ×1

Jsoup获取部分页面

标签 统计

小编sum_mit的帖子

标签统计