如何在swift中解码HTML实体?

cod*_*ies 113 json html-entities swift

我从一个站点提取一个JSON文件,其中一个字符串是:

The Weeknd ‘King Of The Fall’ [Video Premiere] | @TheWeeknd | #SoPhi 
Run Code Online (Sandbox Code Playgroud)

如何将事物&#8216转换为正确的字符?

我已经制作了一个Xcode Playground来演示它:

import UIKit

var error: NSError?
let blogUrl: NSURL = NSURL.URLWithString("http://sophisticatedignorance.net/api/get_recent_summary/")
let jsonData = NSData(contentsOfURL: blogUrl)

let dataDictionary = NSJSONSerialization.JSONObjectWithData(jsonData, options: nil, error: &error) as NSDictionary

var a = dataDictionary["posts"] as NSArray

println(a[0]["title"])
Run Code Online (Sandbox Code Playgroud)

aka*_*kyy 147

没有直接的方法可以做到这一点,但你可以使用NSAttributedString魔法使这个过程尽可能轻松(请注意,此方法也将剥离所有HTML标记):

let encodedString = "The Weeknd <em>&#8216;King Of The Fall&#8217;</em>"

// encodedString should = a[0]["title"] in your case

guard let data = htmlEncodedString.data(using: .utf8) else {
    return nil
}

let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
    .documentType: NSAttributedString.DocumentType.html,
    .characterEncoding: String.Encoding.utf8.rawValue
]

guard let attributedString = try? NSAttributedString(data: data, options: options) else {
    return nil
}

let decodedString = attributedString.string // The Weeknd ‘King Of The Fall’
Run Code Online (Sandbox Code Playgroud)

请记住仅从主线程初始化NSAttributedString.它使用了一些WebKit魔法,因此需要.


您可以创建自己的String扩展以提高可重用性:

extension String {

    init?(htmlEncodedString: String) {

        guard let data = htmlEncodedString.data(using: .utf8) else {
            return nil
        }

        let options: [NSAttributedString.DocumentReadingOptionKey: Any] = [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ]

        guard let attributedString = try? NSAttributedString(data: data, options: options) else {
            return nil
        }

        self.init(attributedString.string)
    }

}


let encodedString = "The Weeknd <em>&#8216;King Of The Fall&#8217;</em>"
let decodedString = String(htmlEncodedString: encodedString)
Run Code Online (Sandbox Code Playgroud)

  • 什么?扩展名*意味着*扩展现有类型以提供新功能. (54认同)
  • 此方法非常繁重,不建议在tableviews或gridviews中使用 (12认同)
  • 我理解你想说的是什么,但否定扩展不是要走的路. (4认同)

Mar*_*n R 77

@ akashivskyy的答案很棒,并演示了如何利用NSAttributedString解码HTML实体.一个可能的缺点(如他所说)是所有 HTML标记也被删除,所以

<strong> 4 &lt; 5 &amp; 3 &gt; 2</strong>
Run Code Online (Sandbox Code Playgroud)

4 < 5 & 3 > 2
Run Code Online (Sandbox Code Playgroud)

在OS X上有CFXMLCreateStringByUnescapingEntities()工作:

let encoded = "<strong> 4 &lt; 5 &amp; 3 &gt; 2 .</strong> Price: 12 &#x20ac;.  &#64; "
let decoded = CFXMLCreateStringByUnescapingEntities(nil, encoded, nil) as String
println(decoded)
// <strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €.  @ 
Run Code Online (Sandbox Code Playgroud)

但这在iOS上不可用.

这是一个纯粹的Swift实现.它解码字符实体引用,如&lt;使用字典,以及所有数字字符实体,如&#64&#x20ac.(请注意,我没有明确列出所有252个HTML实体.)

斯威夫特4:

// Mapping from XML/HTML character entity reference to character
// From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
private let characterEntities : [ Substring : Character ] = [
    // XML predefined entities:
    "&quot;"    : "\"",
    "&amp;"     : "&",
    "&apos;"    : "'",
    "&lt;"      : "<",
    "&gt;"      : ">",

    // HTML character entity references:
    "&nbsp;"    : "\u{00a0}",
    // ...
    "&diams;"   : "?",
]

extension String {

    /// Returns a new string made by replacing in the `String`
    /// all HTML character entity references with the corresponding
    /// character.
    var stringByDecodingHTMLEntities : String {

        // ===== Utility functions =====

        // Convert the number in the string to the corresponding
        // Unicode character, e.g.
        //    decodeNumeric("64", 10)   --> "@"
        //    decodeNumeric("20ac", 16) --> "€"
        func decodeNumeric(_ string : Substring, base : Int) -> Character? {
            guard let code = UInt32(string, radix: base),
                let uniScalar = UnicodeScalar(code) else { return nil }
            return Character(uniScalar)
        }

        // Decode the HTML character entity to the corresponding
        // Unicode character, return `nil` for invalid input.
        //     decode("&#64;")    --> "@"
        //     decode("&#x20ac;") --> "€"
        //     decode("&lt;")     --> "<"
        //     decode("&foo;")    --> nil
        func decode(_ entity : Substring) -> Character? {

            if entity.hasPrefix("&#x") || entity.hasPrefix("&#X") {
                return decodeNumeric(entity.dropFirst(3).dropLast(), base: 16)
            } else if entity.hasPrefix("&#") {
                return decodeNumeric(entity.dropFirst(2).dropLast(), base: 10)
            } else {
                return characterEntities[entity]
            }
        }

        // ===== Method starts here =====

        var result = ""
        var position = startIndex

        // Find the next '&' and copy the characters preceding it to `result`:
        while let ampRange = self[position...].range(of: "&") {
            result.append(contentsOf: self[position ..< ampRange.lowerBound])
            position = ampRange.lowerBound

            // Find the next ';' and copy everything from '&' to ';' into `entity`
            guard let semiRange = self[position...].range(of: ";") else {
                // No matching ';'.
                break
            }
            let entity = self[position ..< semiRange.upperBound]
            position = semiRange.upperBound

            if let decoded = decode(entity) {
                // Replace by decoded character:
                result.append(decoded)
            } else {
                // Invalid entity, copy verbatim:
                result.append(contentsOf: entity)
            }
        }
        // Copy remaining characters to `result`:
        result.append(contentsOf: self[position...])
        return result
    }
}
Run Code Online (Sandbox Code Playgroud)

例:

let encoded = "<strong> 4 &lt; 5 &amp; 3 &gt; 2 .</strong> Price: 12 &#x20ac;.  &#64; "
let decoded = encoded.stringByDecodingHTMLEntities
print(decoded)
// <strong> 4 < 5 & 3 > 2 .</strong> Price: 12 €.  @
Run Code Online (Sandbox Code Playgroud)

斯威夫特3:

// Mapping from XML/HTML character entity reference to character
// From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
private let characterEntities : [ String : Character ] = [
    // XML predefined entities:
    "&quot;"    : "\"",
    "&amp;"     : "&",
    "&apos;"    : "'",
    "&lt;"      : "<",
    "&gt;"      : ">",

    // HTML character entity references:
    "&nbsp;"    : "\u{00a0}",
    // ...
    "&diams;"   : "?",
]

extension String {

    /// Returns a new string made by replacing in the `String`
    /// all HTML character entity references with the corresponding
    /// character.
    var stringByDecodingHTMLEntities : String {

        // ===== Utility functions =====

        // Convert the number in the string to the corresponding
        // Unicode character, e.g.
        //    decodeNumeric("64", 10)   --> "@"
        //    decodeNumeric("20ac", 16) --> "€"
        func decodeNumeric(_ string : String, base : Int) -> Character? {
            guard let code = UInt32(string, radix: base),
                let uniScalar = UnicodeScalar(code) else { return nil }
            return Character(uniScalar)
        }

        // Decode the HTML character entity to the corresponding
        // Unicode character, return `nil` for invalid input.
        //     decode("&#64;")    --> "@"
        //     decode("&#x20ac;") --> "€"
        //     decode("&lt;")     --> "<"
        //     decode("&foo;")    --> nil
        func decode(_ entity : String) -> Character? {

            if entity.hasPrefix("&#x") || entity.hasPrefix("&#X"){
                return decodeNumeric(entity.substring(with: entity.index(entity.startIndex, offsetBy: 3) ..< entity.index(entity.endIndex, offsetBy: -1)), base: 16)
            } else if entity.hasPrefix("&#") {
                return decodeNumeric(entity.substring(with: entity.index(entity.startIndex, offsetBy: 2) ..< entity.index(entity.endIndex, offsetBy: -1)), base: 10)
            } else {
                return characterEntities[entity]
            }
        }

        // ===== Method starts here =====

        var result = ""
        var position = startIndex

        // Find the next '&' and copy the characters preceding it to `result`:
        while let ampRange = self.range(of: "&", range: position ..< endIndex) {
            result.append(self[position ..< ampRange.lowerBound])
            position = ampRange.lowerBound

            // Find the next ';' and copy everything from '&' to ';' into `entity`
            if let semiRange = self.range(of: ";", range: position ..< endIndex) {
                let entity = self[position ..< semiRange.upperBound]
                position = semiRange.upperBound

                if let decoded = decode(entity) {
                    // Replace by decoded character:
                    result.append(decoded)
                } else {
                    // Invalid entity, copy verbatim:
                    result.append(entity)
                }
            } else {
                // No matching ';'.
                break
            }
        }
        // Copy remaining characters to `result`:
        result.append(self[position ..< endIndex])
        return result
    }
}
Run Code Online (Sandbox Code Playgroud)

斯威夫特2:

// Mapping from XML/HTML character entity reference to character
// From http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references
private let characterEntities : [ String : Character ] = [
    // XML predefined entities:
    "&quot;"    : "\"",
    "&amp;"     : "&",
    "&apos;"    : "'",
    "&lt;"      : "<",
    "&gt;"      : ">",

    // HTML character entity references:
    "&nbsp;"    : "\u{00a0}",
    // ...
    "&diams;"   : "?",
]

extension String {

    /// Returns a new string made by replacing in the `String`
    /// all HTML character entity references with the corresponding
    /// character.
    var stringByDecodingHTMLEntities : String {

        // ===== Utility functions =====

        // Convert the number in the string to the corresponding
        // Unicode character, e.g.
        //    decodeNumeric("64", 10)   --> "@"
        //    decodeNumeric("20ac", 16) --> "€"
        func decodeNumeric(string : String, base : Int32) -> Character? {
            let code = UInt32(strtoul(string, nil, base))
            return Character(UnicodeScalar(code))
        }

        // Decode the HTML character entity to the corresponding
        // Unicode character, return `nil` for invalid input.
        //     decode("&#64;")    --> "@"
        //     decode("&#x20ac;") --> "€"
        //     decode("&lt;")     --> "<"
        //     decode("&foo;")    --> nil
        func decode(entity : String) -> Character? {

            if entity.hasPrefix("&#x") || entity.hasPrefix("&#X"){
                return decodeNumeric(entity.substringFromIndex(entity.startIndex.advancedBy(3)), base: 16)
            } else if entity.hasPrefix("&#") {
                return decodeNumeric(entity.substringFromIndex(entity.startIndex.advancedBy(2)), base: 10)
            } else {
                return characterEntities[entity]
            }
        }

        // ===== Method starts here =====

        var result = ""
        var position = startIndex

        // Find the next '&' and copy the characters preceding it to `result`:
        while let ampRange = self.rangeOfString("&", range: position ..< endIndex) {
            result.appendContentsOf(self[position ..< ampRange.startIndex])
            position = ampRange.startIndex

            // Find the next ';' and copy everything from '&' to ';' into `entity`
            if let semiRange = self.rangeOfString(";", range: position ..< endIndex) {
                let entity = self[position ..< semiRange.endIndex]
                position = semiRange.endIndex

                if let decoded = decode(entity) {
                    // Replace by decoded character:
                    result.append(decoded)
                } else {
                    // Invalid entity, copy verbatim:
                    result.appendContentsOf(entity)
                }
            } else {
                // No matching ';'.
                break
            }
        }
        // Copy remaining characters to `result`:
        result.appendContentsOf(self[position ..< endIndex])
        return result
    }
}
Run Code Online (Sandbox Code Playgroud)

  • 这很棒,感谢马丁!这是带有HTML实体完整列表的扩展名:https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555我也稍微调整了它以提供替换所产生的距离偏移.这允许正确调整可能受这些替换影响的任何字符串属性或实体(例如,Twitter实体索引). (10认同)
  • @MichaelWaterfall和Martin这是伟大的!奇迹般有效!我更新了Swift 2的扩展程序http://pastebin.com/juHRJ6au谢谢! (3认同)
  • https://gist.github.com/x0rb0t/a6c190dbefdfedad71143ff7f8153588 完整列表来自 https://dev.w3.org/html5/html-author/charref (2认同)

小智 27

Swift 3版本@ akashivskyy的扩展,

extension String {
    init(htmlEncodedString: String) {
        self.init()
        guard let encodedData = htmlEncodedString.data(using: .utf8) else {
            self = htmlEncodedString
            return
        }

        let attributedOptions: [String : Any] = [
            NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
            NSCharacterEncodingDocumentAttribute: String.Encoding.utf8.rawValue
        ]

        do {
            let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
            self = attributedString.string
        } catch {
            print("Error: \(error)")
            self = htmlEncodedString
        }
    }
}
Run Code Online (Sandbox Code Playgroud)


Aam*_*irR 16

斯威夫特4


  • 字符串扩展计算var
  • 没有额外的守卫/做/抓等......
  • 如果解码失败,则返回原始字符串

extension String {
    var htmlDecoded: String {
        let decoded = try? NSAttributedString(data: Data(utf8), options: [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
        ], documentAttributes: nil).string

        return decoded ?? self
    }
}
Run Code Online (Sandbox Code Playgroud)

  • 我喜欢这个答案的简单性。但是,它在后台运行时会导致崩溃,因为它试图在主线程上运行。 (2认同)

Zai*_*han 14

Swift 2版本 @ akashivskyy的扩展,

 extension String {
     init(htmlEncodedString: String) {
         if let encodedData = htmlEncodedString.dataUsingEncoding(NSUTF8StringEncoding){
             let attributedOptions : [String: AnyObject] = [
            NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
            NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
        ]

             do{
                 if let attributedString:NSAttributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil){
                     self.init(attributedString.string)
                 }else{
                     print("error")
                     self.init(htmlEncodedString)     //Returning actual string if there is an error
                 }
             }catch{
                 print("error: \(error)")
                 self.init(htmlEncodedString)     //Returning actual string if there is an error
             }

         }else{
             self.init(htmlEncodedString)     //Returning actual string if there is an error
         }
     }
 }
Run Code Online (Sandbox Code Playgroud)


pip*_*bar 9

Swift 4版本

extension String {

init(htmlEncodedString: String) {
    self.init()
    guard let encodedData = htmlEncodedString.data(using: .utf8) else {
        self = htmlEncodedString
        return
    }

    let attributedOptions: [NSAttributedString.DocumentReadingOptionKey : Any] = [
        .documentType: NSAttributedString.DocumentType.html,
        .characterEncoding: String.Encoding.utf8.rawValue
    ]

    do {
        let attributedString = try NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil)
        self = attributedString.string
    } catch {
        print("Error: \(error)")
        self = htmlEncodedString
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

  • 请,`rawValue`语法`NSAttributedString.DocumentReadingOptionKey(rawValue:NSAttributedString.DocumentAttributeKey.documentType.rawValue)`和`NSAttributedString.DocumentReadingOptionKey(rawValue:NSAttributedString.DocumentAttributeKey.characterEncoding.rawValue)`太可怕了.用`.documentType`和`.characterEncoding`替换它 (8认同)

You*_*Lin 8

我一直在寻找一个纯 Swift 3.0 实用程序来转义/取消转义 HTML 字符引用(即用于 macOS 和 Linux 上的服务器端 Swift 应用程序),但没有找到任何全面的解决方案,所以我编写了自己的实现:https: //github.com/IBM-Swift/swift-html-entities

HTMLEntities使用 HTML4 命名字符引用以及十六进制/十进制数字字符引用,它将根据 W3 HTML5 规范识别特殊数字字符引用(即,&#x80;应该不转义为欧元符号 (unicode U+20AC) 而不是 unicode字符U+0080, 以及某些范围的数字字符引用U+FFFD在转义时应替换为替换字符)。

用法示例:

import HTMLEntities

// encode example
let html = "<script>alert(\"abc\")</script>"

print(html.htmlEscape())
// Prints ”&lt;script&gt;alert(&quot;abc&quot;)&lt;/script&gt;"

// decode example
let htmlencoded = "&lt;script&gt;alert(&quot;abc&quot;)&lt;/script&gt;"

print(htmlencoded.htmlUnescape())
// Prints ”<script>alert(\"abc\")</script>"
Run Code Online (Sandbox Code Playgroud)

对于 OP 的示例:

print("The Weeknd &#8216;King Of The Fall&#8217; [Video Premiere] | @TheWeeknd | #SoPhi ".htmlUnescape())
// prints "The Weeknd ‘King Of The Fall’ [Video Premiere] | @TheWeeknd | #SoPhi "
Run Code Online (Sandbox Code Playgroud)

编辑:HTMLEntities从 2.0.0 版开始,现在支持 HTML5 命名字符引用。还实现了符合规范的解析。

  • 这是一直有效的最通用的答案,不需要在主线程上运行。这甚至适用于最复杂的 HTML 转义 unicode 字符串(例如`( ͡° ͜ʖ ͡° )`),而其他答案都没有。 (3认同)

wLc*_*wLc 7

extension String{
    func decodeEnt() -> String{
        let encodedData = self.dataUsingEncoding(NSUTF8StringEncoding)!
        let attributedOptions : [String: AnyObject] = [
            NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType,
            NSCharacterEncodingDocumentAttribute: NSUTF8StringEncoding
        ]
        let attributedString = NSAttributedString(data: encodedData, options: attributedOptions, documentAttributes: nil, error: nil)!

        return attributedString.string
    }
}

let encodedString = "The Weeknd &#8216;King Of The Fall&#8217;"

let foo = encodedString.decodeEnt() // The Weeknd ‘King Of The Fall’
Run Code Online (Sandbox Code Playgroud)

  • “The Weeknd”是一位歌手,是的,他的名字就是这么拼写的。 (3认同)

Nai*_*hta 6

斯威夫特 4:

最终使用 HTML 代码、换行符和单引号对我有用的整体解决方案

extension String {
    var htmlDecoded: String {
        let decoded = try? NSAttributedString(data: Data(utf8), options: [
            .documentType: NSAttributedString.DocumentType.html,
            .characterEncoding: String.Encoding.utf8.rawValue
            ], documentAttributes: nil).string

        return decoded ?? self
    }
}
Run Code Online (Sandbox Code Playgroud)

用法:

let yourStringEncoded = yourStringWithHtmlcode.htmlDecoded
Run Code Online (Sandbox Code Playgroud)

然后我不得不应用更多的过滤器来去除单引号(例如,不要没有它是等)和换行符,如\n

var yourNewString = String(yourStringEncoded.filter { !"\n\t\r".contains($0) })
yourNewString = yourNewString.replacingOccurrences(of: "\'", with: "", options: NSString.CompareOptions.literal, range: nil)
Run Code Online (Sandbox Code Playgroud)


Bse*_*orn 5

这将是我的方法。您可以添加来自https://gist.github.com/mwaterfall/25b4a6a06dc3309d9555 Michael Waterfall 提及的实体字典。

extension String {
    func htmlDecoded()->String {

        guard (self != "") else { return self }

        var newStr = self

        let entities = [
            "&quot;"    : "\"",
            "&amp;"     : "&",
            "&apos;"    : "'",
            "&lt;"      : "<",
            "&gt;"      : ">",
        ]

        for (name,value) in entities {
            newStr = newStr.stringByReplacingOccurrencesOfString(name, withString: value)
        }
        return newStr
    }
}
Run Code Online (Sandbox Code Playgroud)

使用的例子:

let encoded = "this is so &quot;good&quot;"
let decoded = encoded.htmlDecoded() // "this is so "good""
Run Code Online (Sandbox Code Playgroud)

或者

let encoded = "this is so &quot;good&quot;".htmlDecoded() // "this is so "good""
Run Code Online (Sandbox Code Playgroud)