Objective-C/Cocoa Touch中的HTML字符解码

tre*_*nik 102 html iphone cocoa cocoa-touch objective-c

首先,我发现了这个: Objective C HTML escape/unescape,但它对我不起作用.

我编码的字符(来自RSS提要,顺便说一句)看起来像这样: &

我在网上搜索并找到了相关的讨论,但没有修复我的特定编码,我认为它们被称为十六进制字符.

Mic*_*all 163

查看我的NSString类别的HTML.以下是可用的方法:

- (NSString *)stringByConvertingHTMLToPlainText;
- (NSString *)stringByDecodingHTMLEntities;
- (NSString *)stringByEncodingHTMLEntities;
- (NSString *)stringWithNewLinesAsBRs;
- (NSString *)stringByRemovingNewLinesAndWhitespace;
Run Code Online (Sandbox Code Playgroud)

  • ARC的代码更新会很方便.Xcode会在构建时抛出大量的ARC错误和警告 (10认同)
  • 经过几个小时的搜索,我知道这是真正有效的唯一方法.对于可以执行此操作的字符串方法,NSString已过期.做得好. (4认同)
  • 老兄,出色的功能.你的stringByDecodingXMLEntities方法让我的一天成功!谢谢! (3认同)
  • 没问题;)很高兴你发现它很有用! (3认同)

Wal*_*ung 52

Daniel的那个基本上非常好,我在那里解决了一些问题:

  1. 删除了NSSCanner的跳过字符(否则将忽略两个连续实体之间的空格

    [scanner setCharactersToBeSkipped:nil];

  2. 当存在孤立的'&'符号时修复了解析(我不确定这是什么'正确'输出,我只是将它与firefox进行比较):

例如

    &#ABC DF & B'  & C' Items (288)
Run Code Online (Sandbox Code Playgroud)

这是修改后的代码:

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];

    [scanner setCharactersToBeSkipped:nil];

    NSCharacterSet *boundaryCharacterSet = [NSCharacterSet characterSetWithCharactersInString:@" \t\n\r;"];

    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"'" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@""" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"<" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }

            if (gotNumber) {
                [result appendFormat:@"%C", (unichar)charCode];

                [scanner scanString:@";" intoString:NULL];
            }
            else {
                NSString *unknownEntity = @"";

                [scanner scanUpToCharactersFromSet:boundaryCharacterSet intoString:&unknownEntity];


                [result appendFormat:@"&#%@%@", xForHex, unknownEntity];

                //[scanner scanUpToString:@";" intoString:&unknownEntity];
                //[result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);

            }

        }
        else {
            NSString *amp;

            [scanner scanString:@"&" intoString:&amp];  //an isolated & symbol
            [result appendString:amp];

            /*
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
             */
        }

    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Run Code Online (Sandbox Code Playgroud)


Mat*_*ges 46

这些被称为角色实体参考.当它们采取形式时,&#<number>;它们被称为数字实体引用.基本上,它是应该替换的字节的字符串表示.在这种情况下&#038;,它表示ISO-8859-1字符编码方案中值为38的字符,即&.

&符号必须在RSS中编码的原因是它是一个保留的特殊字符.

什么,你需要做的是分析字符串和匹配的值的字节替换实体&#;.我不知道在目标C中有什么好方法可以做到这一点,但是这个堆栈溢出问题可能会有所帮助.

编辑:自从大约两年前回答这个问题以来,有一些很好的解决方案; 请参阅下面的@Michael Waterfall的答案.

  • +1我正准备提交完全相同的答案(包括相同的链接,不能少!) (2认同)

Bry*_*uby 45

从iOS 7开始,您可以使用NSAttributedString带有NSHTMLTextDocumentType属性的本地解码HTML字符:

NSString *htmlString = @"&#63743; &amp; &#38; &lt; &gt; &trade; &copy; &hearts; &clubs; &spades; &diams;";
NSData *stringData = [htmlString dataUsingEncoding:NSUTF8StringEncoding];

NSDictionary *options = @{NSDocumentTypeDocumentAttribute:NSHTMLTextDocumentType};
NSAttributedString *decodedString;
decodedString = [[NSAttributedString alloc] initWithData:stringData
                                                 options:options
                                      documentAttributes:NULL
                                                   error:NULL];
Run Code Online (Sandbox Code Playgroud)

解码后的属性字符串现在将显示为:&& <>™©♥♣♠♦.

注意:这仅在主线程上调用时才有效.

  • 如果您不需要支持iOS 6及更早版本,请给出最佳答案 (6认同)
  • 这适用于解码实体,但它也搞砸了非编码的破折号. (4认同)

Nik*_*bak 35

似乎没有人提到最简单的选择之一:Google Toolbox for Mac
(尽管名称,这也适用于iOS.)

https://github.com/google/google-toolbox-for-mac/blob/master/Foundation/GTMNSString%2BHTML.h

/// Get a string where internal characters that are escaped for HTML are unescaped 
//
///  For example, '&amp;' becomes '&'
///  Handles &#32; and &#x32; cases as well
///
//  Returns:
//    Autoreleased NSString
//
- (NSString *)gtm_stringByUnescapingFromHTML;
Run Code Online (Sandbox Code Playgroud)

我不得不在项目中只包含三个文件:标题,实现和GTMDefines.h.

  • 我选择只包含这三个文件,因此我需要这样做以使其与arc兼容:http://code.google.com/p/google-toolbox-for-mac/wiki/ARC_Compatibility (2认同)

Dan*_*son 17

我应该把这个发布在GitHub上.这是一个NSString类,NSScanner用于实现,并处理十六进制和十进制数字字符实体以及通常的符号实体.

此外,它处理格式错误的字符串(当你有一个&后跟一个无效的字符序列)相对优雅,这在我发布的使用此代码的应用程序中变得至关重要.

- (NSString *)stringByDecodingXMLEntities {
    NSUInteger myLength = [self length];
    NSUInteger ampIndex = [self rangeOfString:@"&" options:NSLiteralSearch].location;

    // Short-circuit if there are no ampersands.
    if (ampIndex == NSNotFound) {
        return self;
    }
    // Make result string with some extra capacity.
    NSMutableString *result = [NSMutableString stringWithCapacity:(myLength * 1.25)];

    // First iteration doesn't need to scan to & since we did that already, but for code simplicity's sake we'll do it again with the scanner.
    NSScanner *scanner = [NSScanner scannerWithString:self];
    do {
        // Scan up to the next entity or the end of the string.
        NSString *nonEntityString;
        if ([scanner scanUpToString:@"&" intoString:&nonEntityString]) {
            [result appendString:nonEntityString];
        }
        if ([scanner isAtEnd]) {
            goto finish;
        }
        // Scan either a HTML or numeric character entity reference.
        if ([scanner scanString:@"&amp;" intoString:NULL])
            [result appendString:@"&"];
        else if ([scanner scanString:@"&apos;" intoString:NULL])
            [result appendString:@"'"];
        else if ([scanner scanString:@"&quot;" intoString:NULL])
            [result appendString:@"\""];
        else if ([scanner scanString:@"&lt;" intoString:NULL])
            [result appendString:@"<"];
        else if ([scanner scanString:@"&gt;" intoString:NULL])
            [result appendString:@">"];
        else if ([scanner scanString:@"&#" intoString:NULL]) {
            BOOL gotNumber;
            unsigned charCode;
            NSString *xForHex = @"";

            // Is it hex or decimal?
            if ([scanner scanString:@"x" intoString:&xForHex]) {
                gotNumber = [scanner scanHexInt:&charCode];
            }
            else {
                gotNumber = [scanner scanInt:(int*)&charCode];
            }
            if (gotNumber) {
                [result appendFormat:@"%C", charCode];
            }
            else {
                NSString *unknownEntity = @"";
                [scanner scanUpToString:@";" intoString:&unknownEntity];
                [result appendFormat:@"&#%@%@;", xForHex, unknownEntity];
                NSLog(@"Expected numeric character entity but got &#%@%@;", xForHex, unknownEntity);
            }
            [scanner scanString:@";" intoString:NULL];
        }
        else {
            NSString *unknownEntity = @"";
            [scanner scanUpToString:@";" intoString:&unknownEntity];
            NSString *semicolon = @"";
            [scanner scanString:@";" intoString:&semicolon];
            [result appendFormat:@"%@%@", unknownEntity, semicolon];
            NSLog(@"Unsupported XML character entity %@%@", unknownEntity, semicolon);
        }
    }
    while (![scanner isAtEnd]);

finish:
    return result;
}
Run Code Online (Sandbox Code Playgroud)