我正在创建一个应用程序,使我能够从特定网站获取值到控制台.值来自一个<span>
元素,我正在使用JSoup.
我的挑战与此错误有关:
获取URL时出错
这是我的Java代码:
public class TestSl {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://stackoverflow.com/questions/11970938/java-html-parser-to-extract-specific-data").get();
Elements spans = doc.select("span[class=hidden-text]");
for (Element span: spans) {
System.out.println(span.text());
}
}
}
Run Code Online (Sandbox Code Playgroud)
这是控制台上的错误:
线程"main"中的异常org.jsoup.HttpStatusException:HTTP错误提取URL.Status = 403,URL = Java Html解析器提取特定数据? at org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:590)org.jsoup.helper.HttpConnection $ Response.execute(HttpConnection.java:540)at org.jsoup.helper.HttpConnection.execute(HttpConnection)的.java:227)在org.jsoup.helper.HttpConnection.get(HttpConnection.java:216)在TestSl.main(TestSl.java:19)
我做错了什么,如何解决?
设置用户代理标头:
.userAgent("Mozilla")
Run Code Online (Sandbox Code Playgroud)
例:
Document document = Jsoup.connect("https://stackoverflow.com/questions/11970938/java-html-parser-to-extract-specific-data").userAgent("Mozilla").get();
Elements elements = document.select("span.hidden-text");
for (Element element : elements) {
System.out.println(element.text());
}
Run Code Online (Sandbox Code Playgroud)
堆栈交换
收件箱
声誉和徽章
来源:https://stackoverflow.com/a/7523425/1048340
也许这是相关的:https://meta.stackexchange.com/questions/277369/a-terms-of-service-update-restricting-companies-that-scrape-your-profile-informa