use*_*859 24 python named-entity-recognition nltk stanford-nlp
我正在尝试使用Python NLTK中的Stanford Named Entity Recognizer(NER)提取人员和组织的列表.当我跑:
from nltk.tag.stanford import NERTagger
st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
'/usr/share/stanford-ner/stanford-ner.jar')
r=st.tag('Rami Eid is studying at Stony Brook University in NY'.split())
print(r)
Run Code Online (Sandbox Code Playgroud)
输出是:
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
Run Code Online (Sandbox Code Playgroud)
我想要的是从这个列表中提取这种形式的所有人和组织:
Rami Eid
Sony Brook University
Run Code Online (Sandbox Code Playgroud)
我试图循环遍历元组列表:
for x,y in i:
if y == 'ORGANIZATION':
print(x)
Run Code Online (Sandbox Code Playgroud)
但是这段代码只打印每行一个实体:
Sony
Brook
University
Run Code Online (Sandbox Code Playgroud)
对于真实数据,可以有多个组织,一个句子中的人,我如何在不同实体之间设置限制?
ale*_*xis 28
感谢@Vaulstein发现的链接,很明显,经过培训的斯坦福标记器(至少在2012年)不会分块命名实体.从接受的答案:
许多NER系统使用更复杂的标签,例如IOB标签,其中像B-PERS这样的代码表示人员实体的起始位置.CRFClassifier类和功能工厂支持此类标签,但它们未在我们当前分发的模型中使用(截至2012年)
您有以下选择:
收集标记相同的单词; 例如,标记的所有相邻单词PERSON
应当作为一个命名实体一起使用.这很容易,但当然它有时会组合不同的命名实体.(例如New York, Boston [and] Baltimore
,大约有三个城市,而不是一个.) 编辑:这是Alvas的代码在接受的anwser中所做的.请参阅下面的更简单的实现.
使用nltk.ne_recognize()
.它不使用斯坦福识别器,但它确实是块实体.(它是IOB命名实体标记器的包装器).
找出一种方法,在斯坦福标记器返回的结果之上进行自己的组块.
为您感兴趣的域训练您自己的IOB命名实体chunker(使用Stanford工具或NLTK框架).如果您有时间和资源来做到这一点,它可能会给您最好的结果.
编辑:如果你想要的是抽出连续命名实体的运行(上面的选项1),你应该使用itertools.groupby
:
from itertools import groupby
for tag, chunk in groupby(netagged_words, lambda x:x[1]):
if tag != "O":
print("%-12s"%tag, " ".join(w for w, t in chunk))
Run Code Online (Sandbox Code Playgroud)
如果netagged_words
是(word, type)
您问题中的元组列表,则会产生:
PERSON Rami Eid
ORGANIZATION Stony Brook University
LOCATION NY
Run Code Online (Sandbox Code Playgroud)
再次注意,如果两个相同类型的命名实体紧挨着彼此出现,这种方法将组合它们.例如New York, Boston [and] Baltimore
,大约有三个城市,而不是一个.
alv*_*vas 24
IOB/BIO意味着我支持,O utside ,B eginning(IOB),或者有时也称为B eginning,I nside ,O utside(BIO)
斯坦福NE标签器返回IOB/BIO样式标签,例如
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
Run Code Online (Sandbox Code Playgroud)
在('Rami', 'PERSON'), ('Eid', 'PERSON')
被标记为PERSON和"拉米"是开始或NE组块和"节日"是内侧.然后你会看到任何非NE将被标记为"O".
提取连续NE块的想法与使用正则表达式的命名实体识别非常类似:NLTK但是因为Stanford NE chunker API没有返回一个很好的树来解析,所以你必须这样做:
def get_continuous_chunks(tagged_sent):
continuous_chunk = []
current_chunk = []
for token, tag in tagged_sent:
if tag != "O":
current_chunk.append((token, tag))
else:
if current_chunk: # if the current chunk is not empty
continuous_chunk.append(current_chunk)
current_chunk = []
# Flush the final current_chunk into the continuous_chunk, if any.
if current_chunk:
continuous_chunk.append(current_chunk)
return continuous_chunk
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities = get_continuous_chunks(ne_tagged_sent)
named_entities_str = [" ".join([token for token, tag in ne]) for ne in named_entities]
named_entities_str_tag = [(" ".join([token for token, tag in ne]), ne[0][1]) for ne in named_entities]
print named_entities
print
print named_entities_str
print
print named_entities_str_tag
print
Run Code Online (Sandbox Code Playgroud)
[OUT]:
[[('Rami', 'PERSON'), ('Eid', 'PERSON')], [('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION')], [('NY', 'LOCATION')]]
['Rami Eid', 'Stony Brook University', 'NY']
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
Run Code Online (Sandbox Code Playgroud)
但是请注意,如果两个NE是连续的,那么它可能是错误的,但是我仍然想不出任何两个NE连续而没有任何"O"的例子.
正如@alexis建议的那样,最好将stanford NE输出转换为NLTK树:
from nltk import pos_tag
from nltk.chunk import conlltags2tree
from nltk.tree import Tree
def stanfordNE2BIO(tagged_sent):
bio_tagged_sent = []
prev_tag = "O"
for token, tag in tagged_sent:
if tag == "O": #O
bio_tagged_sent.append((token, tag))
prev_tag = tag
continue
if tag != "O" and prev_tag == "O": # Begin NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag == tag: # Inside NE
bio_tagged_sent.append((token, "I-"+tag))
prev_tag = tag
elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
bio_tagged_sent.append((token, "B-"+tag))
prev_tag = tag
return bio_tagged_sent
def stanfordNE2tree(ne_tagged_sent):
bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
ne_tree = conlltags2tree(sent_conlltags)
return ne_tree
ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'),
('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'),
('in', 'O'), ('NY', 'LOCATION')]
ne_tree = stanfordNE2tree(ne_tagged_sent)
print ne_tree
Run Code Online (Sandbox Code Playgroud)
[OUT]:
(S
(PERSON Rami/NNP Eid/NNP)
is/VBZ
studying/VBG
at/IN
(ORGANIZATION Stony/NNP Brook/NNP University/NNP)
in/IN
(LOCATION NY/NNP))
Run Code Online (Sandbox Code Playgroud)
然后:
ne_in_sent = []
for subtree in ne_tree:
if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
ne_label = subtree.label()
ne_string = " ".join([token for token, pos in subtree.leaves()])
ne_in_sent.append((ne_string, ne_label))
print ne_in_sent
Run Code Online (Sandbox Code Playgroud)
[OUT]:
[('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
Run Code Online (Sandbox Code Playgroud)