使用jsoup从body标签中提取innerHtml

sha*_*ham 3 html java jsoup

我正在使用 jsoup 解析 html 并想提取 body 标签内的 innerHtml

到目前为止,我尝试使用 document.body.childern().outerHtml; 但它只给出 html 元素并跳过 body 内的浮动文本(不包含在任何 html 标签中)

private String getBodyTag(final Document document) {
        return document.body().children().outerHtml();
}
Run Code Online (Sandbox Code Playgroud)

输入:

<!DOCTYPE html>
<html lang="de">
    <head>
        <META http-equiv="Content-Type" content="text/html; charset=UTF-8">
        <link rel="stylesheet" type="text/css" href="assets/style.css">
    </head>
    <body>
       <div>questions to improve formatting and clarity.</div>
       <h3>Guided Mode</h3> 
       some sample raw/floating text
    </body>
</html>
Run Code Online (Sandbox Code Playgroud)

预期的:

<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3> 
some sample raw/floating text
Run Code Online (Sandbox Code Playgroud)

实际的:

<div>questions to improve formatting and clarity.</div>
<h3>Guided Mode</h3>
Run Code Online (Sandbox Code Playgroud)

小智 5

请使用这个:

private String getBodyTag(final Document document) {
    return document.body().html();
}
Run Code Online (Sandbox Code Playgroud)