C#：属性与常量的不同字符串编码

Question

C#：属性与常量的不同字符串编码

我正在为一个旨在删除无效代码点（例如孤立代理对）的函数编写测试。但是，根据我编写测试的方式，我发现代理对的编码方式有所不同。

虽然此版本的测试通过：

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "\uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Run Code Online (Sandbox Code Playgroud)

这个没有：

        [TestCategory("UnitTest")]
        [TestMethod]
        public void RemoveOrhpanedSurrogatePair()
        {
            var input = "\uDDDD1975";
            var cleanText = input.ReplaceInvalidCodePoints();

            Assert.AreEqual(input.Length - 1, cleanText.Length);
            Assert.AreEqual("1975", cleanText);
        }

Run Code Online (Sandbox Code Playgroud)

查看调试器，第一个变体将字符串编码为，"\uDDDD1975"但第二个变体产生"??1975"两个有效字符，而不是一个孤立的代理对。

Answer 1

Ass*_*ael 5

我认为答案的线索可以在（除了）@jonskeet博客文章中找到。显然，C# 在任何地方都使用 UTF16 对字符串进行编码，除了使用 UTF8 的属性c'tors。编译器似乎发现这是一个孤立的代理对，并通过其 UTF8 值将其视为两个无效的 Unicode 字符。然后，这些字符将被一对\uFFFD字符替换（Unicode 替换字符，用于在将二进制解码为文本时指示损坏的数据）。

[Description(Value)]
class Test
{
    const string Value = "\uDDDD";
 
    static void Main()
    {
        var description = (DescriptionAttribute)
            typeof(Test).GetCustomAttributes(typeof(DescriptionAttribute), true)[0];
        DumpString("Attribute", description.Description);
        DumpString("Constant", Value);
    }
 
    static void DumpString(string name, string text)
    {
        var utf16 = text.Select(c => ((uint) c).ToString("x4"));
        Console.WriteLine("{0}: {1}", name, string.Join(" ", utf16));
    }
}

Run Code Online (Sandbox Code Playgroud)

将产生：

Attribute: fffd fffd
Constant: dddd

Run Code Online (Sandbox Code Playgroud)

归档时间：	5 年，4 月前
查看次数：	58 次
最近记录：	5 年，4 月前