如何在python列表中干掉单词?

Cha*_*gaD 18 python nlp

我有像下面这样的python列表

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
Run Code Online (Sandbox Code Playgroud)

现在我需要阻止它(每个单词)并得到另一个列表.我怎么做 ?

Gar*_*tty 35

from stemming.porter2 import stem

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
Run Code Online (Sandbox Code Playgroud)

我们在这里做的是使用列表推导来遍历主列表中的每个字符串,将其拆分为单词列表.然后我们循环遍历该列表,在我们去的时候阻止每个单词,返回新的词干列表.

请注意我没有尝试使用已安装的词干 - 我已经从评论中删除了它,并且从未使用过它.然而,这是将列表拆分为单词的基本概念.请注意,这将生成一个单词列表列表,保持原始分隔.

如果不想要这种分离,你可以这样做:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]
Run Code Online (Sandbox Code Playgroud)

相反,这将留下一个连续的列表.

如果您希望最后将这些词汇合在一起,您可以:

documents = [" ".join(sentence) for sentence in documents]
Run Code Online (Sandbox Code Playgroud)

或者在一行中完成:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]
Run Code Online (Sandbox Code Playgroud)

保持句子结构的地方,或

documents = " ".join(documents)
Run Code Online (Sandbox Code Playgroud)

在哪里无视它.

  • 是否已不再是Python 3中的包? (2认同)

Tho*_*mas 8

你可能想看看NLTK(自然语言工具包).它有一个模块nltk.stem,它包含各种不同的词干分析器.

另见这个问题.