带有nltk.wordnet.synsets的Python IF语句

wat*_*sit 2 python nltk wordnet

import nltk
from nltk import *
from nltk.corpus import wordnet as wn

output=[]
wordlist=[]

entries = nltk.corpus.cmudict.entries()

for entry in entries[:200]: #create a list of words, without the pronounciation since.pos_tag only works with a list
    wordlist.append(entry[0])

for word in nltk.pos_tag(wordlist): #create a list of nouns
    if(word[1]=='NN'):
        output.append(word[0])

for word in output:
    x = wn.synsets(word) #remove all words which does not have synsets (this is the problem)
    if len(x)<1:
        output.remove(word)

for word in output[:200]:
    print (word," ",len(wn.synsets(word)))
Run Code Online (Sandbox Code Playgroud)

我试图删除没有synsets的所有单词但由于某种原因它不起作用.在运行程序时,我发现即使一个单词被称为len(wn.synsets(word))= 0,它也不会从我的列表中删除.谁能告诉我出了什么问题?

unu*_*tbu 5

您无法遍历列表,并同时删除当前项.这是一个演示问题的玩具示例:

In [73]: output = range(10)

In [74]: for item in output:
   ....:     output.remove(item)
Run Code Online (Sandbox Code Playgroud)

您可能希望output删除所有项目.但相反,其中一半仍然存在:

In [75]: output
Out[75]: [1, 3, 5, 7, 9]
Run Code Online (Sandbox Code Playgroud)

为什么你不能同时循环和删除:

想象一下,Python使用内部计数器来记住当前项目的索引for-loop.

当计数器等于0(第一次循环)时,Python执行

output.remove(item)
Run Code Online (Sandbox Code Playgroud)

精细.现在有一个项目少了output.但是然后Python将计数器递增到1.因此word的下一个值是output[1],这是原始列表中的第三个项目.

0  <-- first item removed
1  <-- the new output[0] ** THIS ONE GETS SKIPPED **
2  <-- the new output[1] -- gets removed on the next iteration 
Run Code Online (Sandbox Code Playgroud)

(解决方法)解决方案:

相反,要么迭代副本output,要么构建新列表.在这种情况下,我认为构建新列表更有效:

new_output = []
for word in output:
    x = wn.synsets(word) 
    if len(x)>=1:
        new_output.append(word)
Run Code Online (Sandbox Code Playgroud)