如何将 UTF-16 字节数组重新编码为 UTF-8？

Question

如何将 UTF-16 字节数组重新编码为 UTF-8？

我有一个 UTF-16 字节数组 ( &[u8])，我想在 Rust 中将其解码并重新编码为 UTF-8。

在Python中我可以这样做：

array.decode('UTF-16', errors='ignore').encode('UTF-8')

Run Code Online (Sandbox Code Playgroud)

我怎样才能在 Rust 中做到这一点？

Answer 1

Fin*_*nis 5

这里的问题是，UTF-16 是为 16 位单元定义的，并且没有指定如何将两个 8 位单元（也称为bytes）转换为一个 16 位单元。

\n

因此，我假设您使用的是网络字节序（即大字节序）。请注意，这可能不正确，因为 x86 处理器使用Little endian。

\n

因此重要的第一步是将u8s 转换为u16. 在这种情况下，我将迭代它们，通过转换它们u16:from_be_bytes()，然后将它们收集在向量中。

\n

然后，我们可以使用String::from_utf16()orString::from_utf16_lossy()将转换Vec<u16>为String。

\n

Strings 在 Rust 中内部表示为 UTF-8。.as_bytes()所以我们可以直接通过或提取UTF-8表示.into_bytes()。

\n

fn main() {\n    let utf16_bytes: &[u8] = &[\n        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,\n        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,\n    ];\n\n    let utf16_packets = utf16_bytes\n        .chunks(2)\n        .map(|e| u16::from_be_bytes(e.try_into().unwrap()))\n        .collect::<Vec<_>>();\n\n    let s = String::from_utf16_lossy(&utf16_packets);\n    println!("{:?}", s);\n\n    let utf8_bytes = s.as_bytes();\n    println!("{:?}", utf8_bytes);\n}\n

Run Code Online (Sandbox Code Playgroud)\n

"H\xe2\x82\xacllo world!"\n[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]\n

Run Code Online (Sandbox Code Playgroud)\n

\n

请注意，我们必须.try_into().unwrap()在我们的map()函数中使用。这是因为.chunks_exact()不让编译器知道我们迭代的块有多大。

\n

一旦稳定下来，就有一种array_chunks()方法可以让编译器知道，并使代码更短、更快。\n遗憾的是，它现在仅可用nightly。

\n

#![feature(array_chunks)]\n\nfn main() {\n    let utf16_bytes: &[u8] = &[\n        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,\n        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,\n    ];\n\n    let utf16_packets = utf16_bytes\n        .array_chunks()\n        .cloned()\n        .map(u16::from_be_bytes)\n        .collect::<Vec<_>>();\n\n    let s = String::from_utf16_lossy(&utf16_packets);\n    println!("{:?}", s);\n\n    let utf8_bytes = s.as_bytes();\n    println!("{:?}", utf8_bytes);\n}\n

Run Code Online (Sandbox Code Playgroud)\n

> cargo +nightly run\n"H\xe2\x82\xacllo world!"\n[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]\n

Run Code Online (Sandbox Code Playgroud)\n

这假设我们的输入可以完全转换为u16单位。在生产代码中，建议检查字节数是否奇数。

\n

为了通过错误处理正确地编写此内容，我会将其提取到一个方法中并传播错误：

\n

use thiserror::Error;\n\n#[derive(Error, Debug)]\nenum ParseUTF16Error {\n    #[error("UTF-16 data needs to contain an even amount of bytes")]\n    UnevenByteCount,\n    #[error("The given data does not contain valid UTF16 data")]\n    InvalidContent,\n}\n\nfn parse_utf16(data: &[u8]) -> Result<String, ParseUTF16Error> {\n    let data16 = data\n        .chunks(2)\n        .map(|e| e.try_into().map(u16::from_be_bytes))\n        .collect::<Result<Vec<_>, _>>()\n        .map_err(|_| ParseUTF16Error::UnevenByteCount)?;\n\n    String::from_utf16(&data16).map_err(|_| ParseUTF16Error::InvalidContent)\n}\n\nfn main() {\n    let utf16_bytes: &[u8] = &[\n        0x00, 0x48, 0x20, 0xAC, 0x00, 0x6c, 0x00, 0x6c, 0x00, 0x6f, 0x00, 0x20, 0x00, 0x77, 0x00,\n        0x6f, 0x00, 0x72, 0x00, 0x6c, 0x00, 0x64, 0x00, 0x21,\n    ];\n\n    let s = parse_utf16(utf16_bytes).unwrap();\n    println!("{:?}", s);\n\n    let utf8_bytes = s.as_bytes();\n    println!("{:?}", utf8_bytes);\n}\n

Run Code Online (Sandbox Code Playgroud)\n

"H\xe2\x82\xacllo world!"\n[72, 226, 130, 172, 108, 108, 111, 32, 119, 111, 114, 108, 100, 33]\n

Run Code Online (Sandbox Code Playgroud)\n

归档时间：	3 年，5 月前
查看次数：	1613 次
最近记录：	3 年，4 月前