提取NSString(在BMP之外)的第一个Unicode代码点的最简单方法?

Quu*_*one 4 cocoa nsstring surrogate-pairs

由于历史原因,Cocoa的Unicode实现是16位的:它0xFFFF通过“代理对” 处理上述Unicode字符。这意味着以下代码将无法正常工作:

NSString myString = @"";
uint32_t codepoint = [myString characterAtIndex:0];
printf("%04x\n", codepoint);  // incorrectly prints "d842"
Run Code Online (Sandbox Code Playgroud)

现在,代码在100%的时间内都有效,但是它太冗长了:

NSString myString = @"";
uint32_t codepoint;
[@"" getBytes:&codepoint maxLength:4 usedLength:nil
    encoding:NSUTF32StringEncoding options:0
    range:NSMakeRange(0,2) remainingRange:nil];
printf("%04x\n", codepoint);  // prints "20d20"
Run Code Online (Sandbox Code Playgroud)

并且代码使用mbtowc有效,但仍然很冗长,影响全局状态,不是线程安全的,并且可能在所有这些之上填充了自动释放池:

setlocale(LC_CTYPE, "UTF-8");
wchar_t codepoint;
mbtowc(&codepoint, [@"" UTF8String], 16);
printf("%04x\n", codepoint);  // prints "20d20"
Run Code Online (Sandbox Code Playgroud)

Is there any simple Cocoa/Foundation idiom for extracting the first (or Nth) Unicode codepoint from an NSString? Preferably a one-liner that just returns the codepoint?

The answer given in this otherwise excellent summary of Cocoa Unicode support (near the end of the article) is simply "Don't try it. If your input contains surrogate pairs, filter them out or something, because there's no sane way to handle them properly."

hoo*_*oop 5

单个Unicode代码点可能是一个代理对,但并非所有语言字符都是单个代码点。即,不是所有的语言字符都由一个或两个UTF-16单元表示。许多字符由一系列Unicode代码点表示。

这意味着除非您使用Ascii,否则必须将语言字符视为子字符串,而不是索引处的unicode代码点。

要获取索引0处的字符的子字符串:

NSRange r = [[myString rangeOfComposedCharacterSequenceAtIndex:0];
[myString substringWithRange:r];
Run Code Online (Sandbox Code Playgroud)

根据您实际希望执行的操作,这可能不是您想要的。例如,尽管这将为您提供“字符边界”,但它们将不与特定于语言的光标插入点相对应。