删除标点符号而不删除表情符号

Question

删除标点符号而不删除表情符号

如何仅删除标点符号而不删除表情符号。我想也许有办法用正则表达式来做到这一点？但不确定。

sentence = ['hello', 'world', '!', '']

def remove_punct(token):
    return [word for word in token if word.isalpha()]

print(remove_punct(sentence))
#output
#['hello', 'world']
#desired output
#['hello', 'world', '']

Run Code Online (Sandbox Code Playgroud)

Answer 1

Dan*_*ejo 8

一种方法：

from string import punctuation

sentence = ["hello", "world", "!", ""]

punct_set = set(punctuation)


def remove_punct(token):
    return [word for word in token if word not in punct_set]


print(remove_punct(sentence))

Run Code Online (Sandbox Code Playgroud)

输出

['hello', 'world', '']

Run Code Online (Sandbox Code Playgroud)

该变量punctuation包括：

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

Run Code Online (Sandbox Code Playgroud)

如果存在由多个标点符号组成的单词，可以使用set.isdisjoint, 过滤掉至少包含一个标点符号的单词：

# notice the ...
sentence = ["hello", "world", "!", "", "..."]

def remove_punct(token):
    return [word for word in token if punct_set.isdisjoint(word)]

print(remove_punct(sentence))

Run Code Online (Sandbox Code Playgroud)

输出 （使用 set.isdisjoint）

['hello', 'world', '']

Run Code Online (Sandbox Code Playgroud)

最后，如果你想保留至少包含一个非标点符号的单词，请使用set.issuperset如下：

# notice the ... and Mr.
sentence = ["hello", "world", "mr.", "!", "", "..."]

def remove_punct(token):
    return [word for word in token if not punct_set.issuperset(word)]

print(remove_punct(sentence))

Run Code Online (Sandbox Code Playgroud)

输出 （set.issuperset）

['hello', 'world', 'mr.', '']  # mr. is kept because it contains mr

Run Code Online (Sandbox Code Playgroud)

我知道，但标点符号是一个字符串，而 punct_set 是一个集合，一般来说使用集合更快，但我确实同意字符串没有那么长，所以也许没有优势。需要对其进行基准测试 (2认同)

归档时间：	2 年，10 月前
查看次数：	242 次
最近记录：	2 年，7 月前