使用Jsoup从网页获取没有标签的文本

Question

使用Jsoup从网页获取没有标签的文本

我必须使用Jsoup从网页中提取一些数据.

我很容易提取标签中包含的数据,但我仍然需要一些未标记的数据.

这是HTML源代码的示例:

<a id="aId" href="aLink" style="aStyle">
    <span id="spanId1">
        <b>Caldan Therapeutics</b> 
        Announces Key Appointments And A Collaboration With 
        <b>Sygnature Discovery</b>  
    </span>
    <span id="spanId2" style="spanStyle2">
        5/17/2016
    </span>
</a>

Run Code Online (Sandbox Code Playgroud)

我已经提取了<b>标签中包含的数据以及日期,但我现在想要的是提取句子Announces Key Appointments And A Collaboration With.

如您所见,这句话没有标签.

我该怎么做才能提取它？

我已经完成了我的研究,我所能找到的就是如何剥离所有标签.

谢谢你的帮助!

Answer 1

use*_*868 7

我找到了满足这一特定需求的方式,我想与将来可能面临同样问题的任何人分享.

您所能做的就是使用该功能ownText(),它会从元素的子标签中删除文本.

在我们的例子中:

public static void main(String[] args) throws Exception {
    Document doc = Jsoup.connect("http://source-url").get();
    Elements spanTags = doc.getElementsByTag("span");
    for (Element spanTag : spanTags) {
        String text = spanTag.ownText();
        System.out.println(text);
    }
}

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，8 月前
查看次数：	2509 次
最近记录：	8 年前