我从来没有真正处理NLP但是对NER有一个想法,这个想法不应该有效,并且在一个案例中表现得非常好.我不明白为什么它有效,为什么它不起作用或天气可以延长.
这个想法是通过以下方式提取故事中主要人物的名字:
我在爱丽丝梦游仙境上运行了过于简单的代码(附在下面),其中"Alice"返回:
21 ['老鼠','纬度','威廉','兔子','渡渡鸟','狮鹫','螃蟹','女王','公爵夫人','步兵','黑豹','卡特彼勒', '心','国王','比尔','鸽子','猫','帽匠','野兔','海龟','睡鼠']
虽然它过滤大写单词(并且接收"Alice"作为要聚集的单词),但最初有大约500个大写单词,并且就主要字符而言它仍然很有用.
虽然它给出了有趣的结果,但它与其他角色和其他故事的效果并不理想.
不知道这个想法是否可用,可扩展,或者为什么它在"爱丽丝"这个故事中完全有效?
谢谢!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()
Run Code Online (Sandbox Code Playgroud)
从你的程序外观和以前的NER经验来看,我会说这"有效",因为你没有做适当的评估.你找到了"野兔",你应该找到"三月野兔".
NER的难度(至少对于英语而言)是找不到名字; 它正在检测它们的全部范围("三月野兔"的例子); 甚至在句子开头检测它们,所有单词都大写; 将它们分类为人/组织/位置/等.
此外,作为儿童小说的爱丽丝梦游仙境,是一个相当容易处理的文本.像"微软首席执行官史蒂夫鲍尔默"这样的新闻频道提出了一个更难的问题; 在这里,你想要发现
[ORG Microsoft] CEO [PER Steve Ballmer]
Run Code Online (Sandbox Code Playgroud)