如何从Python字符串中删除unicode"标点符号"

Question

如何从Python字符串中删除unicode"标点符号"

Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?

Answer 1

小智 11

您可以将unicodedata类别用作Python中unicode数据表的一部分:

>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'

Run Code Online (Sandbox Code Playgroud)

正如您所见,标点符号的类别以"P"开头.所以你需要通过char过滤掉char(使用列表推导).

也可以看看:

在你的情况下:

>>> unicodedata.category(u'\ufeff')
'Cf'

Run Code Online (Sandbox Code Playgroud)

因此,您可以根据字符的类别执行一些白名单.

归档时间：	14 年，11 月前
查看次数：	5841 次
最近记录：	11 年，3 月前