如何仅删除标点符号而不删除表情符号。我想也许有办法用正则表达式来做到这一点?但不确定。
sentence = ['hello', 'world', '!', '']
def remove_punct(token):
    return [word for word in token if word.isalpha()]
print(remove_punct(sentence))
#output
#['hello', 'world']
#desired output
#['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)
    一种方法:
from string import punctuation
sentence = ["hello", "world", "!", ""]
punct_set = set(punctuation)
def remove_punct(token):
    return [word for word in token if word not in punct_set]
print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)
输出
['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)
该变量punctuation包括:
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
Run Code Online (Sandbox Code Playgroud)
如果存在由多个标点符号组成的单词,可以使用set.isdisjoint, 过滤掉至少包含一个标点符号的单词:
# notice the ...
sentence = ["hello", "world", "!", "", "..."]
def remove_punct(token):
    return [word for word in token if punct_set.isdisjoint(word)]
print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)
输出 (使用 set.isdisjoint)
['hello', 'world', '']
Run Code Online (Sandbox Code Playgroud)
最后,如果你想保留至少包含一个非标点符号的单词,请使用set.issuperset如下:
# notice the ... and Mr.
sentence = ["hello", "world", "mr.", "!", "", "..."]
def remove_punct(token):
    return [word for word in token if not punct_set.issuperset(word)]
print(remove_punct(sentence))
Run Code Online (Sandbox Code Playgroud)
输出 (set.issuperset)
['hello', 'world', 'mr.', '']  # mr. is kept because it contains mr
Run Code Online (Sandbox Code Playgroud)
        |   归档时间:  |  
           
  |  
        
|   查看次数:  |  
           242 次  |  
        
|   最近记录:  |