maw*_*wus 6 html java connection http-status-code-404 jsoup
我是新与Jsoup,但我不明白为什么我试图获得一个页面时,即使页面是从浏览器访问收到404错误,我不使用任何proxys.我尝试过以下代码:
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url).get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
Run Code Online (Sandbox Code Playgroud)
我收到了异常消息:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:449)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:424)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:178)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:167)
at ro.pago.ucl2015.UCLWebParser.connect(UCLWebParser.java:27)
at ro.pago.ucl2015.UCLWebParser.main(UCLWebParser.java:16)
Run Code Online (Sandbox Code Playgroud)
Alk*_*ris 22
似乎该站点不允许机器人,它将抛出404错误响应,以防它找不到User-Agent标头.以下工作原理是设置用户代理标头
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
Run Code Online (Sandbox Code Playgroud)
用户代理
即使客户端未由用户操作,超文本传输协议(HTTP)也使用"用户代理"标头来识别发起请求的客户端软件.
推荐人(我不认为这是必要的)
HTTP referer(最初是referrer的拼写错误)是一个HTTP头字段,用于标识链接到所请求资源的网页地址(即URI或IRI).
只是为了提供全面服务,我建议您为您的请求设置超时期限.默认值为3秒,如果服务器花费的时间超过了您将收到异常的时间.Bellow使用超时设置器跟踪您的代码.在最长的时间段内将其设置为零.
private static Document connect() {
String url = "http://www.transfermarkt.co.uk/real-madrid/startseite/verein/418";
Document doc = null;
try {
doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com")
.timeout(1000*5) //it's in milliseconds, so this means 5 seconds.
.get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return doc;
}
Run Code Online (Sandbox Code Playgroud)
Udi*_*ahi 12
如果您收到响应代码404,则可以跳过该网址
使用ignoreHttpErrors(true),肯定会解决您的问题
Document doc3 = null;
try {
doc3 = Jsoup.connect(url).userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com").ignoreHttpErrors(true).get();
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
10670 次 |
| 最近记录: |