Python：用十六进制分隔符分割字节

Question

Python：用十六进制分隔符分割字节

我正在处理几个二进制文件，并且想要解析存在的 UTF-8 字符串。

我目前有一个函数，它获取文件的起始位置，然后返回找到的字符串：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter != None and index != None):
       return file.read(size).explode('0x00000000')[index] #incorrect
   else:
       return file.read(size)

Run Code Online (Sandbox Code Playgroud)

文件中的一些字符串是用分隔的0x00 00 00 00，是否可以像PHP的explode那样分割这些字符串？我是 Python 新手，因此欢迎任何有关代码改进的建议。

样本文件：

48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 | 00 00 00 00 | 31 00 32 00 33 00也就是说Hello World123，我通过用条将00 00 00 00分隔符括起来来标记|分隔符。

所以：

str_extract(file, 0x00, 0x20, 0x00000000, 0) => 'Hello World'

Run Code Online (Sandbox Code Playgroud)

相似地：

str_extract(file, 0x00, 0x20, 0x00000000, 1) => '123'

Run Code Online (Sandbox Code Playgroud)

Answer 1

Mar*_*ers 6

我假设您在这里使用 Python 2，但编写的代码可以在 Python 2 和 Python 3 上运行。

您有 UTF-16 数据，而不是 UTF-8。str.split()您可以将其读取为二进制数据，并使用以下方法拆分为四个 NUL 字节：

file.read(size).split(b'\x00' * 4)[index]

Run Code Online (Sandbox Code Playgroud)

生成的数据被编码为 UTF-16 小端（您可能在开始时省略了UTF-16 BOM；您可以使用以下方式解码数据：

result.decode('utf-16-le')

Run Code Online (Sandbox Code Playgroud)

然而，这将会失败，因为我们只是截断了最后一个 NUL 字节的文本；Python 在找到的前 4 个 NUL 上进行分割，并且不会跳过作为文本一部分的最后一个 NUL 字节。

更好的想法是首先解码为 Unicode，然后在 Unicode 双 NUL 代码点上拆分：

file.read(size).decode('utf-16-le').split(u'\x00' * 2)[index]

Run Code Online (Sandbox Code Playgroud)

将其组合为一个函数将是：

def str_extract(file, start, size, delimiter = None, index = None):
   file.seek(start)
   if (delimiter is not None and index is not None):
       delimiter = delimiter.decode('utf-16-le')  # or pass in Unicode
       return file.read(size).decode('utf-16-le').split(delimiter)[index]
   else:
       return file.read(size).decode('utf-16-le')

with open('filename', 'rb') as fobj:
    result = str_extract(fobj, 0, 0x20, b'\x00' * 4, 0)

Run Code Online (Sandbox Code Playgroud)

如果文件在开始时为 BOM，请考虑以 UTF-16 格式打开文件，而不是开始：

import io

with io.open('filename', 'r', encoding='utf16') as fobj:
    # ....

Run Code Online (Sandbox Code Playgroud)

并删除显式解码。

Python 2 演示：

>>> from io import BytesIO
>>> data = b'H\x00e\x00l\x00l\x00o\x00 \x00W\x00o\x00r\x00l\x00d\x00\x00\x00\x00\x001\x002\x003\x00'
>>> fobj = BytesIO(data)
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 0)
u'Hello World'
>>> str_extract(fobj, 0, 0x20, '\x00' * 4, 1)
u'123'

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，9 月前
查看次数：	14703 次
最近记录：	10 年，9 月前