C#.NET中的UTF-16安全子字符串

Kos*_*ukh 12 .net c# string unicode xamarin.ios

我想得到一个给定长度的子字符串150.但是,我想确保我不切断unicode字符之间的字符串.

例如,请参阅以下代码:

var str = "Hello world!";
var substr = str.Substring(0, 6);
Run Code Online (Sandbox Code Playgroud)

substr是一个无效的字符串,因为笑脸字符被切成两半.

相反,我想要一个如下功能:

var str = "Hello world!";
var substr = str.UnicodeSafeSubstring(0, 6);
Run Code Online (Sandbox Code Playgroud)

其中substr包含"你好"

作为参考,以下是我在Objective-C中使用的方法 rangeOfComposedCharacterSequencesForRange

NSString* str = @"Hello world!";
NSRange range = [message rangeOfComposedCharacterSequencesForRange:NSMakeRange(0, 6)];
NSString* substr = [message substringWithRange:range]];
Run Code Online (Sandbox Code Playgroud)

C#中的等效代码是什么?

Luc*_*ski 6

看起来你正在寻找在字形上拆分字符串,即在单个显示的字符上.

在这种情况下,您有一个方便的方法StringInfo.SubstringByTextElements:

var str = "Hello world!";
var substr = new StringInfo(str).SubstringByTextElements(0, 6);
Run Code Online (Sandbox Code Playgroud)

  • 唯一要记住的是,"0"和"6"都是文本元素单位,而不是字符......如果`str ==""`(每个字形是2个字符),`substr`将是``"`,所以`substr.Length == 12` (2认同)

xan*_*tos 6

这应返回从索引开始的最大子字符串startIndex,长度最多为length"完整"字素...因此,初始/最终"分裂"代理项对将被删除,初始组合标记将被删除,最终字符将缺少其组合标记将是除去.

请注意,可能它不是你问的...你似乎想用字形作为度量单位(或者你想要包括最后一个字母,即使它的长度超过length参数)

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            if (startIndex > length)
            {
                break;
            }

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);

            if (startIndex == length)
            {
                break;
            }
        }

        return sb.ToString();
    }
}
Run Code Online (Sandbox Code Playgroud)

Variant将简单地在子字符串的末尾包含"额外"字符,如果有必要使整个字形:

public static class StringEx
{
    public static string UnicodeSafeSubstring(this string str, int startIndex, int length)
    {
        if (str == null)
        {
            throw new ArgumentNullException("str");
        }

        if (startIndex < 0 || startIndex > str.Length)
        {
            throw new ArgumentOutOfRangeException("startIndex");
        }

        if (length < 0)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (startIndex + length > str.Length)
        {
            throw new ArgumentOutOfRangeException("length");
        }

        if (length == 0)
        {
            return string.Empty;
        }

        var sb = new StringBuilder(length);

        int end = startIndex + length;

        var enumerator = StringInfo.GetTextElementEnumerator(str, startIndex);

        while (enumerator.MoveNext())
        {
            if (startIndex >= length)
            {
                break;
            }

            string grapheme = enumerator.GetTextElement();
            startIndex += grapheme.Length;

            // Skip initial Low Surrogates/Combining Marks
            if (sb.Length == 0)
            {
                if (char.IsLowSurrogate(grapheme[0]))
                {
                    continue;
                }

                UnicodeCategory cat = char.GetUnicodeCategory(grapheme, 0);

                if (cat == UnicodeCategory.NonSpacingMark || cat == UnicodeCategory.SpacingCombiningMark || cat == UnicodeCategory.EnclosingMark)
                {
                    continue;
                }
            }

            sb.Append(grapheme);
        }

        return sb.ToString();
    }
}
Run Code Online (Sandbox Code Playgroud)

这将返回你的要求"Hello world!".UnicodeSafeSubstring(0, 6) == "Hello".

  • @ubarar因为它们"不完整":代理对由高代理人和低代理人组成.因此,如果你从一个低代理开始,那么它是无效的.组合标记是相似的:它们是例如在*字符之后放置的变音符号(因此想象'a'+'`')...所以组合标记作为第一个字符是无用的(因为之前没有任何东西可以组合用) (2认同)