删除标点符号而不删除表情符号

jac*_*ell 5 python

如何仅删除标点符号而不删除表情符号。我想也许有办法用正则表达式来做到这一点?但不确定。

sentence = ['hello', 'world', '!', '']

def remove_punct(token):
    return [word for word in token if word.isalpha()]

print(remove_punct(sentence))
#output
#['hello', 'world']
#desired output
#['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)

Dan*_*ejo 8

一种方法:

from string import punctuation

sentence = ["hello", "world", "!", ""]

punct_set = set(punctuation)


def remove_punct(token):
    return [word for word in token if word not in punct_set]


print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)

输出

['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)

该变量punctuation包括:

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
Run Code Online (Sandbox Code Playgroud)

如果存在由多个标点符号组成的单词,可以使用set.isdisjoint, 过滤掉至少包含一个标点符号的单词:

# notice the ...
sentence = ["hello", "world", "!", "", "..."]

def remove_punct(token):
    return [word for word in token if punct_set.isdisjoint(word)]

print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)

输出 (使用 set.isdisjoint)

['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)

最后,如果你想保留至少包含一个非标点符号的单词,请使用set.issuperset如下:

# notice the ... and Mr.
sentence = ["hello", "world", "mr.", "!", "", "..."]

def remove_punct(token):
    return [word for word in token if not punct_set.issuperset(word)]

print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)

输出 (set.issuperset)

['hello', 'world', 'mr.', '']  # mr. is kept because it contains mr
Run Code Online (Sandbox Code Playgroud)

  • 我知道,但标点符号是一个字符串,而 punct_set 是一个集合,一般来说使用集合更快,但我确实同意字符串没有那么长,所以也许没有优势。需要对其进行基准测试 (2认同)