页面内容加载了JavaScript,Jsoup看不到它

Question

页面内容加载了JavaScript,Jsoup看不到它

Eug*_*ene 28 html javascript java parsing jsoup

页面上的一个块由JavaScript填充内容,在使用Jsoup加载页面后,没有任何信息.有没有办法在解析页面时获取JavaScript生成的内容Jsoup？

无法粘贴页面代码,因为它太长了:http://pastebin.com/qw4Rfqgw

这是我需要的内容元素: <div id='tags_list'></div>

我需要用Java获取这些信息.最好使用Jsoup.元素是JavaScript的帮助领域:

<div id="tags_list">
    <a href="/tagsc0t20099.html" style="font-size:14;">?????????</a>
    <a href="/tagsc0t1879.html" style="font-size:14;">Sr</a>
    <a href="/tagsc0t3140.html" style="font-size:14;">??????????????</a>
</div>

Run Code Online (Sandbox Code Playgroud)

Java代码:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

public class Test
{
    public static void main( String[] args )
    {
        try
        {
            Document Doc = Jsoup.connect( "http://www.bestreferat.ru/referat-32558.html" ).get();
            Elements Tags = Doc.select( "#tags_list a" );

            for ( Element Tag : Tags )
            {
                System.out.println( Tag.text() );
            }
        }
        catch ( IOException e )
        {
            e.printStackTrace();
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

Answer 1

fvu*_*fvu 23

JSoup是一个HTML解析器,而不是某种嵌入式浏览器引擎.这意味着它完全没有意识到在初始页面加载后由Javascript添加到DOM的任何内容.

要访问这种类型的内容,您需要一个嵌入式浏览器组件,关于这种组件有很多关于SO的讨论,例如,有没有办法在Java中嵌入浏览器？

是否有其他可用于“ Android”获取页面内容的“库”是否已加载“ javascript”？ (2认同)

Answer 2

ilu*_*hin 14

用com.codeborne.phantomjsdriver解决了我的情况注意:它是groovy代码.

的pom.xml

        <dependency>
          <groupId>com.codeborne</groupId>
          <artifactId>phantomjsdriver</artifactId>
          <version> <here goes last version> </version>
        </dependency>

Run Code Online (Sandbox Code Playgroud)

PhantomJsUtils.groovy

import org.jsoup.Jsoup
import org.jsoup.nodes.Document
import org.openqa.selenium.WebDriver
import org.openqa.selenium.phantomjs.PhantomJSDriver

class PhantomJsUtils {
    private static String filePath = 'data/temp/';

    public static Document renderPage(String filePath) {
        System.setProperty("phantomjs.binary.path", 'libs/phantomjs') // path to bin file. NOTE: platform dependent
        WebDriver ghostDriver = new PhantomJSDriver();
        try {
            ghostDriver.get(filePath);
            return Jsoup.parse(ghostDriver.getPageSource());
        } finally {
            ghostDriver.quit();
        }
    }

    public static Document renderPage(Document doc) {
        String tmpFileName = "$filePath${Calendar.getInstance().timeInMillis}.html";
        FileUtils.writeToFile(tmpFileName, doc.toString());
        return renderPage(tmpFileName);
    }
}

Run Code Online (Sandbox Code Playgroud)

ClassInProject.groovy

Document doc = PhantomJsUtils.renderPage(Jsoup.parse(yourSource))

Run Code Online (Sandbox Code Playgroud)

Answer 3

Vic*_*yew 7

您需要了解正在发生的事情：

当您从网站查询页面时，无论是使用Jsoup还是浏览器，都会向您发送一些HTML。Jsoup能够解析这一点。
但是，大多数网站在该HTML中包含Javascript或从该HTML链接到Javascript，这将在页面中填充内容。您的浏览器能够执行Javascript，从而填充页面。Jsoup不是。

理解这一点的方法如下：解析HTML代码很容易。执行Javascript代码和更新相应的HTML代码要复杂得多，这是浏览器的工作。

以下是针对此类问题的一些解决方案：

如果您可以找到Javascript代码正在执行的Ajax调用（即正在加载内容），则可以在Jsoup中使用这些调用的URL。为此，请使用浏览器中的开发人员工具。但这不能保证能正常工作：
- 网址可能是动态的，并且取决于当时页面上的内容
- 如果内容不是公开的，则将涉及cookie，仅查询资源URL是不够的
在这些情况下，您将需要“模拟”浏览器的工作。幸运的是，存在这样的工具。我知道并推荐的一个是PhantomJS。它可与Javascript一起使用，您需要通过启动新过程从Java启动它。如果您想坚持使用Java，这篇文章列出了一些Java替代方法。

Answer 4

Hit*_*eeb 7

您可以使用 JSoup 和 HtmlUnit 的组合在 JavaScript 脚本加载完成后获取页面内容。

pom.xml

<dependency>
    <groupId>net.sourceforge.htmlunit</groupId>
    <artifactId>htmlunit</artifactId>
    <version>3.35</version>
</dependency>

Run Code Online (Sandbox Code Playgroud)

来自文件的简单示例https://riptutorial.com/jsoup/example/16274/parsing-javascript-generated-page-with-jsoup-and-htmunit

// load page using HTML Unit and fire scripts WebClient webClient2 = new WebClient(); HtmlPage myPage = webClient2.getPage(new File("page.html").toURI().toURL()); // convert page to generated HTML and convert to document Document doc = Jsoup.parse(myPage.asXml()); // iterate row and col for (Element row : doc.select("table#data > tbody > tr")) for (Element col : row.select("td")) // print results System.out.println(col.ownText()); // clean up resources webClient2.close();
Run Code Online (Sandbox Code Playgroud)
一个复杂的例子：加载登录，获取 Session 和 CSRF，然后发布并等待主页加载完成（15 秒）

import java.io.IOException; import java.net.HttpCookie; import java.net.MalformedURLException; import java.net.URL; import org.jsoup.Connection; import org.jsoup.Connection.Method; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; import com.gargoylesoftware.htmlunit.HttpMethod; import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController; import com.gargoylesoftware.htmlunit.WebClient; import com.gargoylesoftware.htmlunit.WebRequest; import com.gargoylesoftware.htmlunit.html.HtmlPage; //JSoup load Login Page and get Session Details Connection.Response res = Jsoup.connect("https://loginpage").method(Method.GET).execute(); String sessionId = res.cookie("findSESSION"); String csrf = res.cookie("findCSRF"); HttpCookie cookie = new HttpCookie("findCSRF", csrf); cookie.setDomain("domain.url"); cookie.setPath("/path"); WebClient webClient = new WebClient(); webClient.addCookie(cookie.toString(), new URL("https://url"), "https://referrer"); // Add other cookies/ Session ... webClient.getOptions().setJavaScriptEnabled(true); webClient.getOptions().setCssEnabled(false); webClient.getOptions().setUseInsecureSSL(true); webClient.getOptions().setThrowExceptionOnFailingStatusCode(false); webClient.getCookieManager().setCookiesEnabled(true); webClient.setAjaxController(new NicelyResynchronizingAjaxController()); // Wait time webClient.waitForBackgroundJavaScript(15000); webClient.getOptions().setThrowExceptionOnScriptError(false); URL url = new URL("https://login.path"); WebRequest requestSettings = new WebRequest(url, HttpMethod.POST); requestSettings.setRequestBody("user=234&pass=sdsdc&CSRFToken="+csrf); HtmlPage page = webClient.getPage(requestSettings); // Wait synchronized (page) { try { page.wait(15000); } catch (InterruptedException e) { e.printStackTrace(); } } // Parse logged in page as needed Document doc = Jsoup.parse(page.asXml());
Run Code Online (Sandbox Code Playgroud)

归档时间：	14 年，4 月前
查看次数：	42009 次
最近记录：	7 年，1 月前