如何使用HtmlUnit从网页中提取没有HTML标签的文本？

Question

如何使用HtmlUnit从网页中提取没有HTML标签的文本？

我刚刚开始使用HTMLUnit,我正在寻找的是获取一个网页并从中提取原始文本减去所有的html标记.

htmlunit可以实现吗？如果是这样,怎么样？或者我应该看另一个图书馆？

例如,如果页面包含

<body><p>para1 test info</p><div><p>more stuff here</p></div>

Run Code Online (Sandbox Code Playgroud)

我想要输出

para1 test info more stuff here

Run Code Online (Sandbox Code Playgroud)

谢谢

Answer 1

Syn*_*tax 5

http://htmlunit.sourceforge.net/gettingStarted.html表明这确实是可能的.

@Test
public void homePage() throws Exception {
    final WebClient webClient = new WebClient();
    final HtmlPage page = webClient.getPage("http://htmlunit.sourceforge.net");
    assertEquals("HtmlUnit - Welcome to HtmlUnit", page.getTitleText());

    final String pageAsXml = page.asXml();
    assertTrue(pageAsXml.contains("<body class=\"composite\">"));

    final String pageAsText = page.asText();
    assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols"));
}

Run Code Online (Sandbox Code Playgroud)

注意: page.asText()命令似乎提供了你正在追求的东西.

用于asText的Javadoc(从DomNode继承到HtmlPage)

归档时间：	15 年，8 月前
查看次数：	3674 次
最近记录：	12 年，5 月前