将文本文件中的唯一单词添加到python中的列表中

Eka*_*234 2 python python-2.7

假设我有以下文本文件:

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Run Code Online (Sandbox Code Playgroud)

我想将此文件中的所有唯一单词添加到列表中

fname = open("romeo.txt")
lst = list()
for line in fname:
    line = line.rstrip()
    words = line.split(' ')
    for word in words:
        if word in lst: continue
        lst = lst + words
    lst.sort()
print lst
Run Code Online (Sandbox Code Playgroud)

但该计划的选择如下:

['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 
'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 
'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 
'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 
'window', 'with', 'yonder']
Run Code Online (Sandbox Code Playgroud)

'和'以及其他几个单词在列表中出现多次.我应该改变循环的哪一部分,以便我没有任何重复的单词?谢谢!

mha*_*wke 6

以下是您的代码问题,更正后的版本如下:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list()                   # lst = [] is more Pythonic
for line in fname:
    line = line.rstrip()       # not required, `split()` will do this anyway
    words = line.split(' ')    # don't specify a delimiter, `line.split()` will split on all white space
    for word in words:
        if word in lst: continue
        lst = lst + words      # this is the reason that you end up with duplicates... words is the list of all words for this line!
    lst.sort()                 # don't sort in the for loop, just once afterwards.
print lst
Run Code Online (Sandbox Code Playgroud)

所以它几乎可以工作,但是,你应该只将当前信息附加word到列表中,而不是所有words你从该行中获得的信息split().如果您只是更改了行:

lst = lst + words
Run Code Online (Sandbox Code Playgroud)

lst.append(word)
Run Code Online (Sandbox Code Playgroud)

它会工作.

这是一个更正版本:

with open("romeo.txt") as infile:
    lst = []
    for line in infile:
        words = line.split()
        for word in words:
            if word not in lst:
                lst.append(word)    # append only this word to the list, not all words on this line
    lst.sort()
    print(lst)
Run Code Online (Sandbox Code Playgroud)

正如其他人所建议的那样,a set是处理这个问题的好方法.这很简单:

with open('romeo.txt') as infile:
    print(sorted(set(infile.read().split())))
Run Code Online (Sandbox Code Playgroud)

使用sorted()您不需要保持对列表的引用.如果您确实想在其他地方使用排序列表,请执行以下操作:

with open('romeo.txt') as infile:
    unique_words = sorted(set(infile.read().split()))
    print(unique_words)
Run Code Online (Sandbox Code Playgroud)

将整个文件读入内存可能不适用于大文件.您可以使用生成器有效地读取文件,而不会使主代码混乱.此生成器将一次读取一行文件,它将一次生成一个单词.它不会一次读取整个文件,除非文件包含一个长行(您的样本数据显然没有):

def get_words(f):
    for line in f:
        for word in line.split():
            yield word

with open('romeo.txt') as infile:
    unique_words = sorted(set(get_words(infile)))
Run Code Online (Sandbox Code Playgroud)