Python:从非BMP unicode char中查找等效的代理项对

Question

Python:从非BMP unicode char中查找等效的代理项对

hil*_*ssu 8 python unicode encoding surrogate-pairs emoji

这里给出了答案:如何在Python中使用代理对？告诉你如何转换代理对,例如'\ud83d\ude4f'转换为单个非BMP unicode字符(答案是"\ud83d\ude4f".encode('utf-16', 'surrogatepass').decode('utf-16')).我想知道如何反过来这样做.我如何使用Python从非BMP字符中找到等效的代理对,将'\U0001f64f'()转换回'\ud83d\ude4f'.我找不到明确的答案.

Answer 1

Mar*_*ers 5

您必须使用代理对手动替换每个非BMP点。您可以使用正则表达式执行此操作：

import re

_nonbmp = re.compile(r'[\U00010000-\U0010FFFF]')

def _surrogatepair(match):
    char = match.group()
    assert ord(char) > 0xffff
    encoded = char.encode('utf-16-le')
    return (
        chr(int.from_bytes(encoded[:2], 'little')) + 
        chr(int.from_bytes(encoded[2:], 'little')))

def with_surrogates(text):
    return _nonbmp.sub(_surrogatepair, text)

Run Code Online (Sandbox Code Playgroud)

演示：

>>> with_surrogates('\U0001f64f')
'\ud83d\ude4f'

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	1920 次
最近记录：	7 年，10 月前