Python:将 utf-8 字符串转换为字节字符串

myt*_*889 6 python string encoding utf-8

我有以下函数可以从字节序列中解析 utf-8 字符串

注意 -- 'length_size' 是表示 utf-8 字符串长度所需的字节数

def parse_utf8(self, bytes, length_size):

    length = bytes2int(bytes[0:length_size])
    value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
    return value


def bytes2int(raw_bytes, signed=False):
    """
    Convert a string of bytes to an integer (assumes little-endian byte order)
    """
    if len(raw_bytes) == 0:
        return None
    fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
    if signed:
        fmt = fmt.lower()
    return struct.unpack('<'+fmt, raw_bytes)[0]
Run Code Online (Sandbox Code Playgroud)

我想反过来写这个函数——即一个函数,它将接受一个 utf-8 编码的字符串,并将它的表示作为一个字节字符串返回。

到目前为止,我有以下几点:

def create_utf8(self, utf8_string):
    return utf8_string.encode('utf-8')
Run Code Online (Sandbox Code Playgroud)

我在尝试测试时遇到以下错误:

  File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
Run Code Online (Sandbox Code Playgroud)

如果可能,我想采用类似于 parse_utf8 示例的代码结构。我究竟做错了什么?

感谢您的帮助!

更新:测试驱动程序,现在正确

def random_utf8_seq(self, length):
    # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html
    test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ????????????????????????????Œœ????????Šš????????Ÿ????Žžƒˆ????˜?–—‘’‚“”„†‡•…‰‹›€™"

    utf8_seq = u""

    for i in range(length):
        utf8_seq += random.choice(test_charset)

    return utf8_seq
Run Code Online (Sandbox Code Playgroud)

我收到以下错误:

input_str = self.random_utf8_seq(200)
  File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte
Run Code Online (Sandbox Code Playgroud)

Dav*_*ric 4

如果 utf-8 => bytestring 转换是您想要的,那么您可以使用str.encode,但首先您需要在示例中正确标记源字符串的类型 - 前缀为uunicode:

\n\n
# coding: utf-8\nimport random\n\n    def random_utf8_seq(length):\n        # from http://www.w3.org/2001/06/utf-8-test/postscript-utf-8.html\n        test_charset = u" !\\"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ \xc2\xa1\xc2\xa2\xc2\xa3\xc2\xa4\xc2\xa5\xc2\xa6\xc2\xa7\xc2\xa8\xc2\xa9\xc2\xaa\xc2\xab\xc2\xac\xc2\xad \xc2\xae\xc2\xaf\xc2\xb0\xc2\xb1\xc2\xb2\xc2\xb3\xc2\xb4\xc2\xb5\xc2\xb6\xc2\xb7\xc2\xb8\xc2\xb9\xc2\xba\xc2\xbb\xc2\xbc\xc2\xbd\xc2\xbe\xc2\xbf\xc3\x80\xc3\x81\xc3\x82\xc3\x83\xc3\x84\xc3\x85\xc3\x86\xc3\x87\xc3\x88\xc3\x89\xc3\x8a\xc3\x8b\xc3\x8c\xc3\x8d\xc3\x8e\xc3\x8f\xc3\x90\xc3\x91\xc3\x92\xc3\x93\xc3\x94\xc3\x95\xc3\x96\xc3\x97\xc3\x98\xc3\x99\xc3\x9a\xc3\x9b\xc3\x9c\xc3\x9d\xc3\x9e\xc3\x9f\xc3\xa0\xc3\xa1\xc3\xa2\xc3\xa3\xc3\xa4\xc3\xa5\xc3\xa6\xc3\xa7\xc3\xa8\xc3\xa9\xc3\xaa\xc3\xab\xc3\xac\xc3\xad\xc3\xae\xc3\xaf\xc3\xb0\xc3\xb1\xc3\xb2\xc3\xb3\xc3\xb4\xc3\xb5\xc3\xb6\xc3\xb7\xc3\xb8\xc3\xb9\xc3\xba\xc3\xbb\xc3\xbc\xc3\xbd\xc3\xbe\xc3\xbf\xc4\x82\xc4\x83\xc4\x84\xc4\x85\xc4\x86\xc4\x87\xc4\x8c\xc4\x8d\xc4\x8e\xc4\x8f\xc4\x90\xc4\x91\xc4\x98\xc4\x99\xc4\x9a\xc4\x9b\xc4\xb9\xc4\xba\xc4\xbd\xc4\xbe\xc5\x81\xc5\x82\xc5\x83\xc5\x84\xc5\x87\xc5\x88\xc5\x90\xc5\x91\xc5\x92\xc5\x93\xc5\x94\xc5\x95\xc5\x98\xc5\x99\xc5\x9a\xc5\x9b\xc5\x9e\xc5\x9f\xc5\xa0\xc5\xa1\xc5\xa2\xc5\xa3\xc5\xa4\xc5\xa5\xc5\xae\xc5\xaf\xc5\xb0\xc5\xb1\xc5\xb8\xc5\xb9\xc5\xba\xc5\xbb\xc5\xbc\xc5\xbd\xc5\xbe\xc6\x92\xcb\x86\xcb\x87\xcb\x98\xcb\x99\xcb\x9b\xcb\x9c\xcb\x9d\xe2\x80\x93\xe2\x80\x94\xe2\x80\x98\xe2\x80\x99\xe2\x80\x9a\xe2\x80\x9c\xe2\x80\x9d\xe2\x80\x9e\xe2\x80\xa0\xe2\x80\xa1\xe2\x80\xa2\xe2\x80\xa6\xe2\x80\xb0\xe2\x80\xb9\xe2\x80\xba\xe2\x82\xac\xe2\x84\xa2"\n\n        utf8_seq = u\'\'\n\n        for i in range(length):\n            utf8_seq += random.choice(test_charset)\n\n        print utf8_seq.encode(\'utf-8\')\n        return utf8_seq.encode(\'utf-8\')\n\n    print( type(random_utf8_seq(200)) )\n
Run Code Online (Sandbox Code Playgroud)\n\n

-- 输出 --\n\xc2\xad

\n\n
\xc3\xb53\xc3\x97s\xc3\x94P{\xc4\x86.s(\xc3\x8b\xc2\xb0\xcb\x99\xc4\x9b\xc3\xb7x\xc3\x93@b\xc5\xb1V\xe2\x80\x94\xc3\xbb\xc2\xb4\xc5\x91\xc2\xa2uZ\xc3\x93\xc4\x8cn\xcb\x9c0|_"\xc3\x90y\xc3\xb8`\xc3\xaa\xc5\xa1\xc2\xb7\xc3\x8f\xc3\x9dhun\xc3\x8d\xc3\x85=\xc3\xa4?\n\xc3\xb3P{tl\xc3\x87\xc5\xb1pb\xc2\xb87s\xc2\xb4\xc5\x88\xc6\x92G\xe2\x80\x94\xc4\x8d\xc3\xb8\xc5\x88\\z\xc4\x8d\xc5\x82\xc5\xa2X\xc3\x82YqL\xc4\x86\xc3\xba\xc4\x9b\xc4\x83(\xc3\xbf\xc3\xae \xc2\xa5Py\xc3\x90\xc3\x94\xc5\x87n\xc3\x97\xc5\x93\xc2\xa6\xc3\x8c\xcb\x9d+\xe2\x80\xa2\xc3\xac\xe2\x80\xba\n\xc5\xbb\xc3\x9b\xc2\xb0\xc3\x91^\xc3\x9dC\xc3\xb7\xc5\xa2\xc5\x90I\xc3\xb1J\xc4\xb9\xc5\xa3\xc3\x92y\xc5\x82\xc2\xad"M\xc5\xa5\xc3\x86\xe2\x80\xb9\xc4\x8c\xc4\x8c4\xc3\xbe!\xc2\xbb\xc5\xa1\xc3\xa5\xc5\xae@\xc3\x96h\xc5\x88-\n\xc3\x88LG\xc4\x84\xc2\xa2\xc3\x9f\xcb\x9b\xc4\x90\xc2\xaf.\xc2\xaa\xc3\x86\xc5\xba\xcb\x98\xc5\x98^\xc4\xbd\xc3\x9b\xc5\xb9\xc3\x8ba\xc4\x82\xc5\x95\xc2\xb9#\xc2\xa2\xc3\xa9\xc3\xbc\xc3\x9c\xc5\x84l\xc3\x8aq\xc5\xa1=V\xc5\x99U\xe2\x80\xa6\xe2\x80\x9a\xe2\x80\x93M\xc5\xbd\xc3\x8e\xc3\x89\xc3\xa8o\xc3\x99\xc5\xb9\xc5\xa0\xc2\xa8\xc3\x90\n<type \'str\'>\n
Run Code Online (Sandbox Code Playgroud)\n