JSoup - 输出为 utf-8 时保留 html 实体？

Question

JSoup - 输出为 utf-8 时保留 html 实体？

我想在使用 JSoup 时保留 html 实体。这是来自网站的 utf-8 测试字符串：

String html = "<html><body>hello &#151; world</body></html>";

String parsed = Jsoup.parse(html).toString();

Run Code Online (Sandbox Code Playgroud)

如果以 utf-8 格式打印解析后的输出，则序列看起来会转换为代码点值为 151 的字符。

有没有办法让 JSoup 在输出为 utf-8 时保留原始实体？如果我以ascii编码输出：

Document.OutputSettings settings = new Document.OutputSettings();
settings.charset(Charset.forName("ascii"));
Jsoup.parse(html).outputSettings(settings).toString();

Run Code Online (Sandbox Code Playgroud)

我去拿：

hello &#x97; world

Run Code Online (Sandbox Code Playgroud)

这就是我正在寻找的。

Answer 1

Ste*_*han 2

您已经发现了 Jsoup 缺失的功能（截至撰写本文时为 Jsoup 1.8.3）。

我可以看到三个选项：

选项1

在https://github.com/jhy/jsoup上发送功能请求我不确定您是否会很快添加...

选项2

使用此答案中提供的解决方法：https ://stackoverflow.com/a/34493022/363573

选项3

编写一个自定义NodeVisitor，将具有代码点值的字符转回其 HTML 等效转义序列。

归档时间：	10 年，8 月前
查看次数：	1131 次
最近记录：	10 年前