是否可以将字节解码为UTF-8,将错误转换为Rust中的转义序列？

Question

是否可以将字节解码为UTF-8,将错误转换为Rust中的转义序列？

在Rust中,通过这样做可以从字节中获取UTF-8:

if let Ok(s) = str::from_utf8(some_u8_slice) {
    println!("example {}", s);
}

Run Code Online (Sandbox Code Playgroud)

这可能有效,也可能没有,但Python有能力处理错误,例如:

s = some_bytes.decode(encoding='utf-8', errors='surrogateescape');

Run Code Online (Sandbox Code Playgroud)

在此示例中,参数surrogateescape将无效的utf-8序列转换为转义码,因此它们不是忽略或替换无法解码的文本,而是替换为有效的字节文字表达式utf-8.请参阅:Python文档了解详细信息.

Rust是否有办法从字节中获取UTF-8字符串,从而逃避错误而不是完全失败？

Answer 1

She*_*ter 5

是的,通过String::from_utf8_lossy:

fn main() {
    let text = [104, 101, 0xFF, 108, 111];
    let s = String::from_utf8_lossy(&text);
    println!("{}", s); // he?lo
}

Run Code Online (Sandbox Code Playgroud)

如果您需要对该过程进行更多控制,可以std::str::from_utf8按照其他答案的建议使用.但是,没有理由按照它的建议对字节进行双重验证.

一个快速被黑客攻击的例子:

use std::str;

fn example(mut bytes: &[u8]) -> String {
    let mut output = String::new();

    loop {
        match str::from_utf8(bytes) {
            Ok(s) => {
                // The entire rest of the string was valid UTF-8, we are done
                output.push_str(s);
                return output;
            }
            Err(e) => {
                let (good, bad) = bytes.split_at(e.valid_up_to());

                if !good.is_empty() {
                    let s = unsafe {
                        // This is safe because we have already validated this
                        // UTF-8 data via the call to `str::from_utf8`; there's
                        // no need to check it a second time
                        str::from_utf8_unchecked(good)
                    };
                    output.push_str(s);
                }

                if bad.is_empty() {
                    //  No more data left
                    return output;
                }

                // Do whatever type of recovery you need to here
                output.push_str("<badbyte>");

                // Skip the bad byte and try again
                bytes = &bad[1..];
            }
        }
    }
}

fn main() {
    let r = example(&[104, 101, 0xFF, 108, 111]);
    println!("{}", r); // he<badbyte>lo
}

Run Code Online (Sandbox Code Playgroud)

您可以扩展它以获取值来替换坏字节,使用闭包来处理坏字节等.例如:

fn example(mut bytes: &[u8], handler: impl Fn(&mut String, &[u8])) -> String {
    // ...    
                handler(&mut output, bad);
    // ...
}

Run Code Online (Sandbox Code Playgroud)

let r = example(&[104, 101, 0xFF, 108, 111], |output, bytes| {
    use std::fmt::Write;
    write!(output, "\\U{{{}}}", bytes[0]).unwrap()
});
println!("{}", r); // he\U{255}lo

Run Code Online (Sandbox Code Playgroud)

也可以看看:

请注意，`from_utf8_lossy` 不像 Python 那样提供不同的错误处理方式。而不是转义，无效的 utf-8 序列被替换为 `U+FFFD`（匹配 Python 的 `replace` 行为）。所以我认为这个问题的简短答案是**不**，尽管它仍然值得一提`from_utf8_lossy`。 (3认同)
`from_utf8_lossy`状态的文档:*"在此转换期间,from_utf8_lossy()将用U + FFFD REPLACEMENT CHARACTER替换任何无效的UTF-8序列,如下所示: "*.所以这是一个替代品,而不是逃避序列.这个答案的第一部分显示了如何使用转义序列进行转换:http://stackoverflow.com/a/41450295/432509 (2认同)

归档时间：	9 年，2 月前
查看次数：	1765 次
最近记录：	7 年，5 月前