我使用NLTK ne_chunk从文本中提取命名实体:
my_sent = "WASHINGTON -- In the wake of a string of abuses by New York police officers in the 1990s, Loretta E. Lynch, the top federal prosecutor in Brooklyn, spoke forcefully about the pain of a broken trust that African-Americans felt and said the responsibility for repairing generations of miscommunication and mistrust fell to law enforcement."
nltk.ne_chunk(my_sent, binary=True)
Run Code Online (Sandbox Code Playgroud)
但我无法弄清楚如何将这些实体保存到列表中?例如 -
print Entity_list
('WASHINGTON', 'New York', 'Loretta', 'Brooklyn', 'African')
Run Code Online (Sandbox Code Playgroud)
谢谢.
我有一个涉及大量文本数据的机器学习任务。我想在训练文本中识别并提取名词短语,以便稍后在管道中将其用于特征构建。我已经从文本中提取了我想要的名词短语的类型,但是我对NLTK还是很陌生,所以我以一种可以分解列表理解的每一步的方式来解决这个问题,如下所示。
但是我真正的问题是,我在这里重塑车轮吗?有没有我看不到的更快的方法?
import nltk
import pandas as pd
myData = pd.read_excel("\User\train_.xlsx")
texts = myData['message']
# Defining a grammar & Parser
NP = "NP: {(<V\w+>|<NN\w?>)+.*<NN\w?>}"
chunkr = nltk.RegexpParser(NP)
tokens = [nltk.word_tokenize(i) for i in texts]
tag_list = [nltk.pos_tag(w) for w in tokens]
phrases = [chunkr.parse(sublist) for sublist in tag_list]
leaves = [[subtree.leaves() for subtree in tree.subtrees(filter = lambda t: t.label == 'NP')] for tree in phrases]
Run Code Online (Sandbox Code Playgroud)
将我们最终得到的元组列表的列表扁平化为仅元组列表的列表
leaves = [tupls for sublists in leaves for tupls in sublists]
Run Code Online (Sandbox Code Playgroud)
将提取的术语加入一个二元组
nounphrases = …Run Code Online (Sandbox Code Playgroud) 给定一个括号内的解析,我可以将它转换为NLTK中的Tree对象:
>>> from nltk.tree import Tree
>>> s = '(ROOT (S (NP (NNP Europe)) (VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends)))) (. .)))'
>>> Tree.fromstring(s)
Tree('ROOT', [Tree('S', [Tree('NP', [Tree('NNP', ['Europe'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('PP', [Tree('IN', ['in']), Tree('NP', [Tree('DT', ['the']), Tree('JJ', ['same']), Tree('NNS', ['trends'])])])]), Tree('.', ['.'])])])
Run Code Online (Sandbox Code Playgroud)
但是当我尝试遍历它时,我只能访问最顶层的树:
>>> for i in Tree.fromstring(s):
... print i
...
(S
(NP (NNP Europe))
(VP (VBZ is) (PP (IN in) (NP (DT the) (JJ same) (NNS trends))))
(. .)) …Run Code Online (Sandbox Code Playgroud)