Jsoup - 提取文本

Eug*_*sky 8 java iteration text-extraction jsoup

我需要从这样的节点中提取文本:

<div>
    Some text <b>with tags</b> might go here.
    <p>Also there are paragraphs</p>
    More text can go without paragraphs<br/>
</div>
Run Code Online (Sandbox Code Playgroud)

我需要建立:

Some text <b>with tags</b> might go here.
Also there are paragraphs
More text can go without paragraphs
Run Code Online (Sandbox Code Playgroud)

Element.text只返回div的所有内容.Element.ownText - 所有不属于儿童元素的东西.两者都错了.迭代children忽略文本节点.

是否有方法迭代元素的内容以接收文本节点.例如

  • 文本节点 - 一些文本
  • 节点<b> - 带标签
  • 文本节点 - 可能会在这里.
  • 节点<p> - 还有段落
  • 文本节点 - 更多文本可以没有段落
  • 节点<br> - <empty>

Vad*_*rev 12

Element.children()返回一个Elements对象 - 一个Element对象列表.查看父类Node,您将看到允许您访问任意节点的方法,而不仅仅是Elements,例如Node.childNodes().

public static void main(String[] args) throws IOException {
    String str = "<div>" +
            "    Some text <b>with tags</b> might go here." +
            "    <p>Also there are paragraphs</p>" +
            "    More text can go without paragraphs<br/>" +
            "</div>";

    Document doc = Jsoup.parse(str);
    Element div = doc.select("div").first();
    int i = 0;

    for (Node node : div.childNodes()) {
        i++;
        System.out.println(String.format("%d %s %s",
                i,
                node.getClass().getSimpleName(),
                node.toString()));
    }
}
Run Code Online (Sandbox Code Playgroud)

结果:

1 TextNode 
 Some text 
2 Element <b>with tags</b>
3 TextNode  might go here. 
4 Element <p>Also there are paragraphs</p>
5 TextNode  More text can go without paragraphs
6 Element <br/>