打开与Jsoup的连接,获取状态代码并解析文档

Question

打开与Jsoup的连接,获取状态代码并解析文档

我正在使用jsoup创建一个类,它将执行以下操作:

构造函数打开与url的连接.
我有一个方法,将检查页面的状态.即200,404等
我有一个方法来解析页面并返回一个网址列表.#

下面是我正在尝试做的粗略工作,而不是非常粗糙,因为我一直在尝试很多不同的事情

public class ParsePage {
private String path;
Connection.Response response = null;

private ParsePage(String langLocale){
    try {
        response = Jsoup.connect(path)
                .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
                .timeout(10000)
                .execute();
    } catch (IOException e) {
        System.out.println("io - "+e);
    }
}

public int getSitemapStatus(){
    int statusCode = response.statusCode();
    return statusCode;
}

public ArrayList<String> getUrls(){
    ArrayList<String> urls = new ArrayList<String>();

 }
}

Run Code Online (Sandbox Code Playgroud)

正如您所看到的,我可以获取页面状态,但是使用构造函数中已经打开的连接我不知道如何解析文档,我尝试使用:

Document doc = connection.get();

Run Code Online (Sandbox Code Playgroud)

但那是不行的.有什么建议？或者更好的方法来解决这个问题？

Answer 1

Ale*_*lex 15

正如在JSoup的文档所陈述Connection.Response类型,还有一个parse()其解析响应的身体作为方法Document并返回它.当你拥有它时,你可以用它做任何你想做的事.

例如,请参阅执行 getUrls()

public class ParsePage {
   private String path;
   Connection.Response response = null;

   private ParsePage(String langLocale){
      try {
         response = Jsoup.connect(path)
            .userAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.21 (KHTML, like Gecko) Chrome/19.0.1042.0 Safari/535.21")
            .timeout(10000)
            .execute();
      } catch (IOException e) {
         System.out.println("io - "+e);
      }
   }

   public int getSitemapStatus() {
      int statusCode = response.statusCode();
      return statusCode;
   }

   public ArrayList<String> getUrls() {
      ArrayList<String> urls = new ArrayList<String>();
      Document doc = response.parse();
      // do whatever you want, for example retrieving the <url> from the sitemap
      for (Element url : doc.select("url")) {
         urls.add(url.select("loc").text());
      }
      return urls;
   }
}

Run Code Online (Sandbox Code Playgroud)

Answer 2

Igo*_*tos 6

如果您不需要登录,请使用:

Document doc = Jsoup.connect("url").get();

Run Code Online (Sandbox Code Playgroud)

如果您需要登录我建议使用:

Response res = Jsoup.connect("url")
    .data("loginField", "yourUser", "passwordField", "yourPassword")
    .method(Method.POST)
    .execute();
Document doc = res.parse();

//If you need to keep logged in to the page, use
Map<String, String> cookies = res.cookies;

//And by every consequent connection, you'll need to use
Document pageWhenAlreadyLoggedIn = Jsoup.connect("url").cookies(cookies).get();

Run Code Online (Sandbox Code Playgroud)

在你的使用中获取网址我可能会尝试

Elements elems = doc.select(a[href]);
for (Element elem : elems) {
    String link = elem.attr("href");
}

Run Code Online (Sandbox Code Playgroud)

这就是它.保持良好的工作

Answer 3

小智 5

您应该能够在响应对象上调用 parse() 。

Document doc = response.parse();

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，8 月前
查看次数：	26392 次
最近记录：	8 年，8 月前