使用Objective-C将HTML文本转换为纯文本

Igo*_*yuk 25 html objective-c nsstring ios

NSString内心有大量HTML文本.该字符串的长度超过3.500.000个字符.如何将此HTML文本转换为NSString内部的纯文本.我使用的是扫描仪,但效果太慢了.任何的想法 ?

o15*_*1s2 67

这取决于您要定位的iOS版本.从iOS7开始,有一种内置方法,不仅可以剥离HTML标记,还可以将格式设置为字符串:

Xcode 9/Swift 4

if let htmlStringData = htmlString.data(using: .utf8), let attributedString = try? NSAttributedString(data: htmlStringData, options: [.documentType : NSAttributedString.DocumentType.html], documentAttributes: nil) {
    print(attributedString)
}
Run Code Online (Sandbox Code Playgroud)

你甚至可以创建这样的扩展:

extension String {
    var htmlToAttributedString: NSAttributedString? {
        guard let data = self.data(using: .utf8) else {
            return nil
        }

        do {
            return try NSAttributedString(data: data, options: [.documentType : NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
        } catch {
            print("Cannot convert html string to attributed string: \(error)")
            return nil
        }
    }
}
Run Code Online (Sandbox Code Playgroud)

请注意,此示例代码使用UTF8编码.您甚至可以创建函数而不是计算属性,并将编码添加为参数.

斯威夫特3

let attributedString = try NSAttributedString(data: htmlString.dataUsingEncoding(NSUTF8StringEncoding)!,
                                              options: [NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType],
                                              documentAttributes: nil)
Run Code Online (Sandbox Code Playgroud)

Objective-C的

[[NSAttributedString alloc] initWithData:[htmlString dataUsingEncoding:NSUTF8StringEncoding] options:@{NSDocumentTypeDocumentAttribute: NSHTMLTextDocumentType, NSCharacterEncodingDocumentAttribute: [NSNumber numberWithInt:NSUTF8StringEncoding]} documentAttributes:nil error:nil];
Run Code Online (Sandbox Code Playgroud)

如果你只需要删除之间的所有内容<>(肮脏的方式!),如果你有字符串中这些字符,这可能是有问题的,使用此:

- (NSString *)stringByStrippingHTML {
   NSRange r;
   NSString *s = [[self copy] autorelease];
   while ((r = [s rangeOfString:@"<[^>]+>" options:NSRegularExpressionSearch]).location != NSNotFound)
     s = [s stringByReplacingCharactersInRange:r withString:@""];
   return s;
}
Run Code Online (Sandbox Code Playgroud)

  • +1用于挤入"一个班轮"(: (3认同)

Igo*_*yuk 16

我用扫描仪解决了我的问题,但我不是用它来处理所有文本.在将所有部分连接在一起之前,我将它用于每10,000个文本部分.我的代码如下

-(NSString *)convertHTML:(NSString *)html {

    NSScanner *myScanner;
    NSString *text = nil;
    myScanner = [NSScanner scannerWithString:html];

    while ([myScanner isAtEnd] == NO) {

        [myScanner scanUpToString:@"<" intoString:NULL] ;

        [myScanner scanUpToString:@">" intoString:&text] ;

        html = [html stringByReplacingOccurrencesOfString:[NSString stringWithFormat:@"%@>", text] withString:@""];
    }
    //
    html = [html stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]];

    return html;
}
Run Code Online (Sandbox Code Playgroud)

斯威夫特4:

var htmlToString(html:String) -> String {
        var htmlStr =html;
        let scanner:Scanner = Scanner(string: htmlStr);
        var text:NSString? = nil;
        while scanner.isAtEnd == false {
            scanner.scanUpTo("<", into: nil);
            scanner.scanUpTo(">", into: &text);
            htmlStr = htmlStr.replacingOccurrences(of: "\(text ?? "")>", with: "");
        }
        htmlStr = htmlStr.trimmingCharacters(in: CharacterSet.whitespacesAndNewlines);
        return htmlStr;
}
Run Code Online (Sandbox Code Playgroud)