jsoup在链接href中转义与号

Question

jsoup在链接href中转义与号

JSoup在链接href的URL的查询部分中将“＆”号转义。给定以下示例

    String l_input = "<html><body>before <a href=\"http://a.b.com/ct.html\">link text</a> after</body></html>";
    org.jsoup.nodes.Document l_doc = org.jsoup.Jsoup.parse(l_input);
    org.jsoup.select.Elements l_html_links = l_doc.getElementsByTag("a");
    for (org.jsoup.nodes.Element l : l_html_links) {
      l.attr("href", "http://a.b.com/ct.html?a=111&b=222");
    }
    String l_output = l_doc.outerHtml();

Run Code Online (Sandbox Code Playgroud)

输出是

    <html>
    <head></head>
    <body>
    before 
    <a href="http://a.b.com/ct.html?a=111&amp;b=222">link text</a> after
    </body>
    </html>

Run Code Online (Sandbox Code Playgroud)

单＆逃脱到＆amp; 。它不应该保留为＆吗？

Answer 1

d0x*_*d0x 5

看来你做不到。我通过消息来源，找到了逃生发生的地方。

它在Attribute.java中定义

/**
 Get the HTML representation of this attribute; e.g. {@code href="index.html"}.
 @return HTML
 */
public String html() {
    return key + "=\"" + Entities.escape(value, (new Document("")).outputSettings()) + "\"";
}

Run Code Online (Sandbox Code Playgroud)

在这里，您可以看到它正在使用Entities.java。jsoup采用默认的outputSettings。new document("");这是您无法覆盖此设置的方式。

也许您应该为此发布功能请求。

顺便说一句：默认的转义模式设置为base。

该Documet.java创建一个默认OutputSettings对象，有它的定义。看到：

/**
 * A HTML Document.
 *
 * @author Jonathan Hedley, jonathan@hedley.net 
 */
public class Document extends Element {
    private OutputSettings outputSettings = new OutputSettings();
    // ...
}


/**
 * A Document's output settings control the form of the text() and html() methods.
 */
public static class OutputSettings implements Cloneable {
    private Entities.EscapeMode escapeMode = Entities.EscapeMode.base;
    // ...
}

Run Code Online (Sandbox Code Playgroud)

解决方法（转义为XML）：

随着StringEscapeUtils从Apache的百科全书朗项目，你可以逃避那些认为伊斯利。看到：

    String unescapedXml = StringEscapeUtils.unescapeXml(l_output);
    System.out.println(unescapedXml);

Run Code Online (Sandbox Code Playgroud)

这将打印：

<html>
 <head></head>
 <body>
  before 
  <a href="http://a.b.com/ct.html?a=111&b=222">link text</a> after
 </body>
</html>

Run Code Online (Sandbox Code Playgroud)

但是，当然，它将取代所有&...

归档时间：	12 年，9 月前
查看次数：	2180 次
最近记录：	12 年，6 月前