用于拉丁语,中文,西里尔语等的子串UTF-8

Question

用于拉丁语,中文,西里尔语等的子串UTF-8

在Windows Phone上,我想将任何给定字符串子串到长度相当于100个ASCII字符的字符串.

String.Length显然没用,因为中文字符串每个字符使用3个字节,丹麦字符串每个字符使用2或4个字节,俄语字符串每个字符使用4个字节.

唯一可用的编码是UTF-8和UTF-16.那我该怎么办？

这个想法是这样的:

private static string UnicodeSubstring(string text, int length)
{
    var bytes = Encoding.UTF8.GetBytes(text);

    return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length));
}

Run Code Online (Sandbox Code Playgroud)

但是长度需要使用每个字符使用的字节数正确分割,因此最后一个字符始终正确呈现.

Answer 1

Jon*_*eet 6

一种选择是简单地遍历字符串,计算每个字符的字节数.

如果你知道你不需要处理BMP之外的字符,这很简单:

public string SubstringWithinUtf8Limit(string text, int byteLimit)
{
    int byteCount = 0;
    char[] buffer = new char[1];
    for (int i = 0; i < text.Length; i++)
    {
        buffer[0] = text[i];
        byteCount += Encoding.UTF8.GetByteCount(buffer);
        if (byteCount > byteLimit)
        {
            // Couldn't add this character. Return its index
            return text.Substring(0, i);
        }
    }
    return text;
}

Run Code Online (Sandbox Code Playgroud)

如果你必须处理代理对,它会变得有点棘手:(

归档时间：	13 年前
查看次数：	2552 次
最近记录：	7 年，6 月前