Jsoup css选择器代码(包含xpath代码)

Question

Jsoup css选择器代码(包含xpath代码)

PTS*_*min 4 xpath tag-soup css-selectors html-parsing jsoup

我试图使用jsoup解析HTML下面但不能为它获得正确的语法.

<div class="info"><strong>Line 1:</strong> some text 1<br>
  <b>some text 2</b><br>
  <strong>Line 3:</strong> some text 3<br>
</div>

Run Code Online (Sandbox Code Playgroud)

我需要在三个不同的变量中捕获一些文本1,一些文本2和一些文本3.

我有第一行的xpath(第3行应该类似)但无法计算出等效的css选择器.

//div[@class='info']/strong[1]/following::text()

Run Code Online (Sandbox Code Playgroud)

请帮忙.

在一个单独的我有几百个html文件,需要解析并从中提取数据以存储在数据库中.Jsoup是最好的选择吗？

我试图重新打开这个问题,因为我还没有找到解决方案.请帮忙.

Answer 1

laz*_*laz 5

看起来Jsoup看起来无法处理带有混合内容的元素的文本.这是一个使用您使用XOM和TagSoup制定的XPath的解决方案:

import java.io.IOException;

import nu.xom.Builder;
import nu.xom.Document;
import nu.xom.Nodes;
import nu.xom.ParsingException;
import nu.xom.ValidityException;
import nu.xom.XPathContext;

import org.ccil.cowan.tagsoup.Parser;
import org.xml.sax.SAXException;

public class HtmlTest {
    public static void main(final String[] args) throws SAXException, ValidityException, ParsingException, IOException {
        final String html = "<div class=\"info\"><strong>Line 1:</strong> some text 1<br><b>some text 2</b><br><strong>Line 3:</strong> some text 3<br></div>";
        final Parser parser = new Parser();
        final Builder builder = new Builder(parser);
        final Document document = builder.build(html, null);
        final nu.xom.Element root = document.getRootElement();
        final Nodes textElements = root.query("//xhtml:div[@class='info']/xhtml:strong[1]/following::text()", new XPathContext("xhtml", root.getNamespaceURI()));
        for (int textNumber = 0; textNumber < textElements.size(); ++textNumber) {
            System.out.println(textElements.get(textNumber).toXML());
        }
    }
}

Run Code Online (Sandbox Code Playgroud)

这输出:

 some text 1
some text 2
Line 3:
 some text 3

Run Code Online (Sandbox Code Playgroud)

虽然不知道你要做什么的更多具体细节,但我不确定这是否正是你想要的.

归档时间：	13 年，6 月前
查看次数：	10013 次
最近记录：	12 年，7 月前