Jsoup 404错误

maw*_*wus 6 html java connection http-status-code-404 jsoup

我是新与Jsoup,但我不明白为什么我试图获得一个页面时,即使页面是从浏览器访问收到404错误,我不使用任何proxys.我尝试过以下代码:

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url).get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}
Run Code Online (Sandbox Code Playgroud)

我收到了异常消息:

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)
Run Code Online (Sandbox Code Playgroud)

Alk*_*ris 22

似乎该站点不允许机器人,它将抛出404错误响应,以防它找不到User-Agent标头.以下工作原理是设置用户代理标头

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com")              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
}
Run Code Online (Sandbox Code Playgroud)

用户代理

即使客户端未由用户操作,超文本传输​​协议(HTTP)也使用"用户代理"标头来识别发起请求的客户端软件.


推荐人(我不认为这是必要的)

HTTP referer(最初是referrer的拼写错误)是一个HTTP头字段,用于标识链接到所请求资源的网页地址(即URI或IRI).

只是为了提供全面服务,我建议您为您的请求设置超时期限.默认值为3秒,如果服务器花费的时间超过了您将收到异常的时间.Bellow使用超时设置器跟踪您的代码.在最长的时间段内将其设置为零.

private static Document connect() {
    String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
    Document doc = null;
    try {
        doc = Jsoup.connect(url)
               .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
               .referrer("http://www.google.com") 
               .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
               .get();
    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return doc;
} 
Run Code Online (Sandbox Code Playgroud)

  • 没问题。事实上,您采取额外的步骤通过评论来感谢我,这已经足够了。我很高兴能帮上忙。PS 检查我的更新。 (2认同)

Udi*_*ahi 12

如果您收到响应代码404,则可以跳过该网址

使用ignoreHttpErrors(true),肯定会解决您的问题

Document doc3 = null;
    try {
        doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").ignoreHttpErrors(true).get();

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
Run Code Online (Sandbox Code Playgroud)