将任意二进制数据存储在仅接受有效UTF8的系统上

Question

将任意二进制数据存储在仅接受有效UTF8的系统上

N. *_*cA. 5 python unicode utf-8 python-2.7

我有任意二进制数据.我需要将它存储在一个需要有效UTF8的系统中.它永远不会被解释为文本,我只需将其放在那里并能够检索它并重新构建我的二进制数据.

显然base64会起作用,但我不能有那么多的通货膨胀.

我怎样才能在python 2.7中轻松实现这一点？

Answer 1

您必须仅使用 ASCII 字符来表达数据。就使二进制数据适合可打印文本（也是 UTF-8 安全）而言，使用 Base64 是最有效的方法（Python 标准库中提供）。当然，表达相同的数据需要多 33% 的空间，但其他方法需要更多的额外空间。

您可以将其与压缩结合起来，以限制这将占用多少空间，但将压缩设为可选（标记数据），并且仅在数据较小时才实际使用它。

import zlib
import base64

def pack_utf8_safe(data):
    is_compressed = False
    compressed = zlib.compress(data)
    if len(compressed) < (len(data) - 1):
        data = compressed
        is_compressed = True
    base64_encoded = base64.b64encode(data)
    if is_compressed:
        base64_encoded = '.' + base64_encoded
    return base64_encoded

def unpack_utf8_safe(base64_encoded):
    decompress = False
    if base64_encoded.startswith('.'):
        base64_encoded = base64_encoded[1:]
        decompress = True
    data = base64.b64decode(base64_encoded)
    if decompress:
        data = zlib.decompress(data)
    return data

Run Code Online (Sandbox Code Playgroud)

该'.'字符不是 Base64 字母表的一部分，因此我在这里用它来标记压缩数据。

您可以进一步删除=Base64 编码数据末尾的 1 或 2 个填充字符；然后可以在解码时重新添加这些（添加'=' * (-len(encoded) * 4)到末尾），但我不确定这是否值得打扰。

您可以通过切换到Base85 编码来进一步节省成本，这是一种针对二进制数据的 4 比 5 比例的 ASCII 安全编码，因此可节省 20% 的开销。对于 Python 2.7，这仅在外部库中可用（Python 3.4将其添加到base64库中）。您可以在2.7中使用python-mom项目：

from mom.codec import base85

Run Code Online (Sandbox Code Playgroud)

并将所有base64.b64encode()andbase64.b64decode()调用替换为base85.b85encode()andbase85.b85decode()调用。

如果您 100% 确定路径上没有任何内容会将您的数据视为文本（可能会更改行分隔符，或解释和更改其他控制代码），您还可以使用 Base128 编码，将开销减少到 14.3% 的增加（每 7 个字节 8 个字符）。但是，我无法向您推荐可通过 pip 安装的 Python 模块；有一个GitHub 托管模块，但我还没有测试过。

归档时间：	11 年，5 月前
查看次数：	380 次
最近记录：	11 年，5 月前