一些烦人的字符没有被 unicodedata 规范化

Ruc*_*hit 1 python unicode unicode-normalization python-3.x python-unicode

我有一个如下所示的 python 字符串。该字符串来自一家美国上市公司向 SEC 提交的文件。我试图使用unicodedata.normalise函数从字符串中删除一些烦人的字符,但这并没有删除所有字符。这种行为背后的原因可能是什么?

from unicodedata import normalize
s = 'GTS.Client.Services@JPMChase.com\nFacsimile\nNo.:\xa0 312-233-2266\n\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xa0 Hiral Patel\nFacsimile No.:\xa0 312-385-7096\n\xa0\nLadies and Gentlemen:\n\xa0\nReference is made to the\nCredit Agreement, dated as of May\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries,\xa0Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

normalize('NFKC', s)
'GTS.Client.Services@JPMChase.com\nFacsimile\nNo.:  312-233-2266\n \nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:  Hiral Patel\nFacsimile No.:  312-385-7096\n \nLadies and Gentlemen:\n \nReference is made to the\nCredit Agreement, dated as of May 7, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \x93Credit Agreement\x94), by and among\nHawaiian Electric Industries, Inc., a Hawaii corporation (the \x93Borrower\x94), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \x93Administrative Agent\x94).'

Run Code Online (Sandbox Code Playgroud)

从输出中可以看出,字符已得到正确处理,但、和\xa0等字符没有标准化,而是与结果字符串中的一样。\x92\x93\x94

Mar*_*nen 5

您的数据被解码为 ISO-8859-1(又名latin1),但这些 Unicode 代码点是该编码中的控制字符。在 Windows-1252(又名cp1252)中,它们是所谓的智能引号:

\n
>>> \'\\x92\\x93\\x94\'.encode(\'latin1\').decode(\'cp1252\')\n\'\xe2\x80\x99\xe2\x80\x9c\xe2\x80\x9d\'\n
Run Code Online (Sandbox Code Playgroud)\n

标准化后它们也不会改变,但至少如果正确解码,它们会正确显示:

\n
>>> ud.normalize(\'NFKC\',\'\\x92\\x93\\x94\'.encode(\'latin1\').decode(\'cp1252\'))\n\'\xe2\x80\x99\xe2\x80\x9c\xe2\x80\x9d\'\n>>> print(s.encode(\'latin1\').decode(\'cp1252\'))\nGTS.Client.Services@JPMChase.com\nFacsimile\nNo.:\xc2\xa0 312-233-2266\n\xc2\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xc2\xa0 Hiral Patel\nFacsimile No.:\xc2\xa0 312-385-7096\n\xc2\xa0\nLadies and Gentlemen:\n\xc2\xa0\nReference is made to the\nCredit Agreement, dated as of May\xc2\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \xe2\x80\x9cCredit Agreement\xe2\x80\x9d), by and among\nHawaiian Electric Industries,\xc2\xa0Inc., a Hawaii corporation (the \xe2\x80\x9cBorrower\xe2\x80\x9d), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \xe2\x80\x9cAdministrative Agent\xe2\x80\x9d).\n
Run Code Online (Sandbox Code Playgroud)\n

请注意,\\xa0代码点是 U+00A0(无中断空格),并且规范化为空格:

\n
>>> ud.normalize(\'NFKC\',\'\\x92\\x93\\x94\'.encode(\'latin1\').decode(\'cp1252\'))\n\'\xe2\x80\x99\xe2\x80\x9c\xe2\x80\x9d\'\n>>> print(s.encode(\'latin1\').decode(\'cp1252\'))\nGTS.Client.Services@JPMChase.com\nFacsimile\nNo.:\xc2\xa0 312-233-2266\n\xc2\xa0\nJPMorgan Chase Bank,\nN.A., as Administrative Agent\n10 South Dearborn, Floor 7th\nIL1-0010\nChicago, IL 60603-2003\nAttention:\xc2\xa0 Hiral Patel\nFacsimile No.:\xc2\xa0 312-385-7096\n\xc2\xa0\nLadies and Gentlemen:\n\xc2\xa0\nReference is made to the\nCredit Agreement, dated as of May\xc2\xa07, 2010 (as the same may be amended,\nrestated, supplemented or otherwise modified from time to time, the \xe2\x80\x9cCredit Agreement\xe2\x80\x9d), by and among\nHawaiian Electric Industries,\xc2\xa0Inc., a Hawaii corporation (the \xe2\x80\x9cBorrower\xe2\x80\x9d), the Lenders from time to\ntime party thereto and JPMorgan Chase Bank, N.A., as issuing bank and\nadministrative agent (the \xe2\x80\x9cAdministrative Agent\xe2\x80\x9d).\n
Run Code Online (Sandbox Code Playgroud)\n

它无需标准化即可正确打印:

\n
>>> ud.name(\'\\xa0\')\n\'NO-BREAK SPACE\'\n>>> ud.normalize(\'NFKC\',\'\\xa0\')\n\' \'\n>>> ud.name(ud.normalize(\'NFKC\',\'\\xa0\'))\n\'SPACE\'\n
Run Code Online (Sandbox Code Playgroud)\n