rem*_*mio 7 c# string unicode utf-8
我有一个显示UTF-8编码字符的字符串,我想将其转换回Unicode.
目前,我的实现如下:
public static string DecodeFromUtf8(this string utf8String)
{
// read the string as UTF-8 bytes.
byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);
// convert them into unicode bytes.
byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);
// builds the converted string.
return Encoding.Unicode.GetString(encodedBytes);
}
Run Code Online (Sandbox Code Playgroud)
我正在玩这个词"déjà".我已经通过这个在线工具将其转换为UTF-8 ,所以我开始用字符串测试我的方法"déjÃ".
不幸的是,通过这种实现,字符串保持不变.
我哪里错了?
bam*_*s53 16
所以问题是UTF-8代码单元值已经存储为C#中的16位代码单元序列string.您只需要验证每个代码单元是否在一个字节的范围内,将这些值复制到字节中,然后将新的UTF-8字节序列转换为UTF-16.
public static string DecodeFromUtf8(this string utf8String)
{
// copy the string as UTF-8 bytes.
byte[] utf8Bytes = new byte[utf8String.Length];
for (int i=0;i<utf8String.Length;++i) {
//Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
utf8Bytes[i] = (byte)utf8String[i];
}
return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}
DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà
Run Code Online (Sandbox Code Playgroud)
这很容易,但最好找到根本原因; 有人将UTF-8代码单元复制到16位代码单元的位置.可能的罪魁祸首是有人string使用错误的编码将字节转换为C#.例如Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).
或者,如果您确定您知道用于生成字符串的错误编码,并且错误的编码转换是无损的(通常情况下,如果不正确的编码是单字节编码),那么您可以简单地执行逆编码步骤获取原始的UTF-8数据,然后您可以从UTF-8字节进行正确的转换:
public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
// the inverse of `mistake.GetString(originalBytes);`
byte[] originalBytes = mistake.GetBytes(mangledString);
return correction.GetString(originalBytes);
}
UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);
Run Code Online (Sandbox Code Playgroud)
我有一个显示UTF-8编码字符的字符串
在.NET中没有这样的东西.字符串类只能以UTF-16编码存储字符串.UTF-8编码的字符串只能作为byte []存在.尝试将字节存储到字符串中并不会有好结果; UTF-8使用没有有效Unicode代码点的字节值.当字符串规范化时,内容将被销毁.因此,在DecodeFromUtf8()开始运行时恢复字符串已经太晚了.
仅处理带byte []的UTF-8编码文本.并使用UTF8Encoding.GetString()来转换它.
小智 9
如果你有一个UTF-8字符串,每个字节都是正确的('Ö' - > [195,0],[150,0]),你可以使用以下内容:
public static string Utf8ToUtf16(string utf8String)
{
/***************************************************************
* Every .NET string will store text with the UTF-16 encoding, *
* known as Encoding.Unicode. Other encodings may exist as *
* Byte-Array or incorrectly stored with the UTF-16 encoding. *
* *
* UTF-8 = 1 bytes per char *
* ["100" for the ansi 'd'] *
* ["206" and "186" for the russian '?'] *
* *
* UTF-16 = 2 bytes per char *
* ["100, 0" for the ansi 'd'] *
* ["186, 3" for the russian '?'] *
* *
* UTF-8 inside UTF-16 *
* ["100, 0" for the ansi 'd'] *
* ["206, 0" and "186, 0" for the russian '?'] *
* *
* First we need to get the UTF-8 Byte-Array and remove all *
* 0 byte (binary 0) while doing so. *
* *
* Binary 0 means end of string on UTF-8 encoding while on *
* UTF-16 one binary 0 does not end the string. Only if there *
* are 2 binary 0, than the UTF-16 encoding will end the *
* string. Because of .NET we don't have to handle this. *
* *
* After removing binary 0 and receiving the Byte-Array, we *
* can use the UTF-8 encoding to string method now to get a *
* UTF-16 string. *
* *
***************************************************************/
// Get UTF-8 bytes and remove binary 0 bytes (filler)
List<byte> utf8Bytes = new List<byte>(utf8String.Length);
foreach (byte utf8Byte in utf8String)
{
// Remove binary 0 bytes (filler)
if (utf8Byte > 0) {
utf8Bytes.Add(utf8Byte);
}
}
// Convert UTF-8 bytes to UTF-16 string
return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}
Run Code Online (Sandbox Code Playgroud)
在我的例子中,DLL结果也是UTF-8字符串,但遗憾的是UTF-8字符串是用UTF-16编码解释的('Ö' - > [195,0],[19,32]).所以ANSI' - '是150被转换为UTF-16' - '即8211.如果你也有这种情况,你可以使用以下代码:
public static string Utf8ToUtf16(string utf8String)
{
// Get UTF-8 bytes by reading each byte with ANSI encoding
byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);
// Convert UTF-8 bytes to UTF-16 bytes
byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);
// Return UTF-16 bytes as UTF-16 string
return Encoding.Unicode.GetString(utf16Bytes);
}
Run Code Online (Sandbox Code Playgroud)
或Native-Method:
[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);
public static string Utf8ToUtf16(string utf8String)
{
Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
if (iNewDataLen > 1)
{
StringBuilder utf16String = new StringBuilder(iNewDataLen);
MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);
return utf16String.ToString();
}
else
{
return String.Empty;
}
}
Run Code Online (Sandbox Code Playgroud)
如果您需要反过来,请参阅Utf16ToUtf8.希望我能提供帮助.
您所拥有的似乎是string从另一种编码(可能是美国 Windows 默认代码页 1252 )错误解码的。以下是假设没有其他损失的情况下如何逆转的方法。一种不立即明显的损失是non-breaking space字符串末尾未显示的 (U+00A0)。当然,一开始就正确读取数据源会更好,但也许数据源一开始就存储不正确。
using System;\nusing System.Text;\n\nclass Program\n{\n static void Main(string[] args)\n {\n string junk = "d\xc3\x83\xc2\xa9j\xc3\x83\\xa0"; // Bad Unicode string\n\n // Turn string back to bytes using the original, incorrect encoding.\n byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);\n\n // Use the correct encoding this time to convert back to a string.\n string good = Encoding.UTF8.GetString(bytes);\n Console.WriteLine(good);\n }\n}\nRun Code Online (Sandbox Code Playgroud)\n\n结果:
\n\nd\xc3\xa9j\xc3\xa0\nRun Code Online (Sandbox Code Playgroud)\n
| 归档时间: |
|
| 查看次数: |
110940 次 |
| 最近记录: |