使用nltk从文本文件中提取所有名词

Rak*_*van 14 python nltk

有更有效的方法吗?我的代码读取文本文件并提取所有名词.

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)
Run Code Online (Sandbox Code Playgroud)

如何减少此代码的时间复杂度?有没有办法避免使用嵌套的for循环?

提前致谢!

Azi*_*lto 19

如果您对除了以外的选项持开放态度NLTK,请查看TextBlob.它可以轻松地提取所有名词和名词短语:

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']
Run Code Online (Sandbox Code Playgroud)

  • 你可以使用 `blob.tags` 来过滤掉 `NN` 之类的东西,比如 `[n for n,t in blob.tags if t == 'NN']`。 (2认同)
  • 就我个人而言,我发现 `TextBlob` 的性能几乎不如 `nltk` (2认同)
  • 代码可能更简单,但 `textblob` 调用 NLTK 来标记和标记。这*不能*减少OP代码的“时间复杂度”。 (2认同)

Boa*_*Boa 16

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']
Run Code Online (Sandbox Code Playgroud)

有用的提示:通常情况下,列表推导是一种更快的构建列表的方法,而不是在'for'循环中使用.insert()或append()方法向列表添加元素.


Sam*_*Nde 8

您可以使用取得了良好的效果nltkTextblobSpaCy或任何其他许多图书馆在那里。这些库都可以完成工作,但是效率不同。

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""
Run Code Online (Sandbox Code Playgroud)

在jupyter笔记本电脑上的Windows 10 2核,4处理器,8GB ram i5 hp笔记本电脑上,我进行了一些比较,结果如下。

对于TextBlob:

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])
Run Code Online (Sandbox Code Playgroud)

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations
Run Code Online (Sandbox Code Playgroud)

对于nltk:

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])
Run Code Online (Sandbox Code Playgroud)

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations
Run Code Online (Sandbox Code Playgroud)

对于spacy:

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])
Run Code Online (Sandbox Code Playgroud)

输出是

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations
Run Code Online (Sandbox Code Playgroud)

看起来nltk并且TextBlob相当快,这是可以预期的,因为不存储有关输入文本的其他信息txt。Spacy要慢得多。还有一件事。SpaCy错过了名词NLP,而nltkTextBlob得到了它。我会打一枪换nltkTextBlob除非有别的东西,我想提取从输入txt


查看spacy 此处的快速入门在这里
查看一些基础知识。在此处查看HowTosTextBlob
nltk

  • SpaCy 错过了 NLP,因为它 find 是一个专有名词 (PNOUN)。SpaCy 速度较慢,因为它具有更多功能,但您可以禁用语法解析器并大大加快速度。 (2认同)

Ami*_*osh 5

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)
Run Code Online (Sandbox Code Playgroud)

只是更简单一点。