如何删除python字符串的最后一个utf8字符

Question

如何删除python字符串的最后一个utf8字符

我有一个包含utf-8编码文本的字符串.我需要删除最后一个utf-8字符.

到目前为止我做到了

msg = msg[:-1]

Run Code Online (Sandbox Code Playgroud)

但这只删除了最后一个字节.只要最后一个字符是ASCII码,它就可以工作.当最后一个字符是多字节字符时,它不再起作用.

Answer 1

Mar*_*ers 5

在最简单的办法就是你的解码UTF-8字节Unicode文本:

without_last = msg.decode('utf8')[:-1]

Run Code Online (Sandbox Code Playgroud)

您可以随时重新编码.

另一种方法是搜索UTF-8起始字节 ; UTF-8字节序列始终以最高有效位设置为的字节开始0,或者两个最高有效位设置为1,而连续字节始终以10:

# find starting byte of last codepoint
pos = len(msg) - 1
while pos > -1 and ord(msg[pos]) & 0xC0 == 0x80:
    # character at pos is a continuation byte (bit 7 set, bit 6 not)
    pos -= 1
msg = msg[:pos]

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，3 月前
查看次数：	1606 次
最近记录：	10 年，3 月前