Python从除撇号之外的unicode字符串中删除标点符号

Question

Python从除撇号之外的unicode字符串中删除标点符号

Kam*_*ing 9 python regex unicode punctuation

我发现了几个这方面的主题,我找到了这个解决方案:

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

Run Code Online (Sandbox Code Playgroud)

这应该删除除了'之外的每个标点符号,问题是它还会删除句子中的所有其他标点符号.

例:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

Run Code Online (Sandbox Code Playgroud)

当然我想要的是保持句子没有标点符号,"warhol"保持原样

期望的输出:

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

Run Code Online (Sandbox Code Playgroud)

编辑:我也试过用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

Run Code Online (Sandbox Code Playgroud)

但这会删除每个标点符号

Answer 1

C.B*_*.B. 9

指定你的元素不想要移除,即\w,\d,\s等等,这是什么样的^运营商,在方括号表示.(匹配除外)

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，6 月前
查看次数：	6357 次
最近记录：	10 年，6 月前