我有一个停用词列表.我有一个搜索字符串.我想从字符串中删除单词.
举个例子:
stopwords=['what','who','is','a','at','is','he']
query='What is hello'
Run Code Online (Sandbox Code Playgroud)
现在代码应该删除'What'和'is'.但是在我的情况下,它会删除'a',以及'at'.我在下面给出了我的代码.我能做错什么?
for word in stopwords:
if word in query:
print word
query=query.replace(word,"")
Run Code Online (Sandbox Code Playgroud)
如果输入查询是"What is Hello",我得到的输出为:
wht s llo
为什么会这样?
Rob*_*sen 40
这是一种方法:
query = 'What is hello'
stopwords = ['what','who','is','a','at','is','he']
querywords = query.split()
resultwords = [word for word in querywords if word.lower() not in stopwords]
result = ' '.join(resultwords)
print result
Run Code Online (Sandbox Code Playgroud)
我注意到如果它的小写变体在列表中,你也想删除一个单词,所以我lower()在条件检查中添加了一个调用.
当提供一个由空格分隔的单词列表时,可接受的答案会起作用,但现实生活中并非如此,可以使用标点符号来分隔单词。在这种情况下re.split是必需的。
同样,以stopwordsas set进行测试可以使查找更快(即使在单词数量较少的情况下,即使在字符串哈希与查找之间进行权衡)
我的建议:
import re
query = 'What is hello? Says Who?'
stopwords = {'what','who','is','a','at','is','he'}
resultwords = [word for word in re.split("\W+",query) if word.lower() not in stopwords]
print(resultwords)
Run Code Online (Sandbox Code Playgroud)
输出(作为单词列表):
['hello','Says']
Run Code Online (Sandbox Code Playgroud)
建立在 karthikr 所说的基础上,尝试
' '.join(filter(lambda x: x.lower() not in stopwords, query.split()))
Run Code Online (Sandbox Code Playgroud)
解释:
query.split() #splits variable query on character ' ', e.i. "What is hello" -> ["What","is","hello"]
filter(func,iterable) #takes in a function and an iterable (list/string/etc..) and
# filters it based on the function which will take in one item at
# a time and return true.false
lambda x: x.lower() not in stopwords # anonymous function that takes in variable,
# converts it to lower case, and returns true if
# the word is not in the iterable stopwords
' '.join(iterable) #joins all items of the iterable (items must be strings/chars)
#using the string/char in front of the dot, i.e. ' ' as a joiner.
# i.e. ["What", "is","hello"] -> "What is hello"
Run Code Online (Sandbox Code Playgroud)
在查看您问题的其他答案时,我注意到他们告诉您如何做您想做的事情,但是他们没有回答您最后提出的问题。
如果输入查询为“什么是Hello”,则输出为:
wht s llo为什么会这样?
发生这种情况是因为.replace()完全替换了您给它的子字符串。
例如:
"My, my! Hello my friendly mystery".replace("my", "")
Run Code Online (Sandbox Code Playgroud)
给出:
>>> "My, ! Hello friendly stery"
Run Code Online (Sandbox Code Playgroud)
.replace()本质上是用作为第一个参数的子字符串分割字符串,并将其与第二个参数连接在一起。
"hello".replace("he", "je")
Run Code Online (Sandbox Code Playgroud)
在逻辑上类似于:
"je".join("hello".split("he"))
Run Code Online (Sandbox Code Playgroud)
如果您仍想使用.replace删除整个单词,您可能会认为在前后添加一个空格就足够了,但这会在字符串的开头和结尾以及标点符号形式的子字符串中省去一些单词。
"My, my! hello my friendly mystery".replace(" my ", " ")
>>> "My, my! hello friendly mystery"
"My, my! hello my friendly mystery".replace(" my", "")
>>> "My,! hello friendlystery"
"My, my! hello my friendly mystery".replace("my ", "")
>>> "My, my! hello friendly mystery"
Run Code Online (Sandbox Code Playgroud)
此外,在之前和之后添加空格将不会捕获重复项,因为它已经处理了第一个子字符串,并且会忽略它,而继续执行以下操作:
"hello my my friend".replace(" my ", " ")
>>> "hello my friend"
Run Code Online (Sandbox Code Playgroud)
由于这些原因,建议您通过Robby Cornelissen 接受的答案来做您想要的事情。