Abt*_*Pst 5 python nlp nltk pos-tagger stanford-nlp
我正在尝试使用StanfordNERTagger和nltk从一段文本中提取关键字.
docText="John Donk works for POI. Brian Jones wants to meet with Xyz Corp. for measuring POI's Short Term performance Metrics."
words = re.split("\W+",docText)
stops = set(stopwords.words("english"))
#remove stop words from the list
words = [w for w in words if w not in stops and len(w) > 2]
str = " ".join(words)
print str
stn = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
stanfordPosTagList=[word for word,pos in stp.tag(str.split()) if pos == 'NNP']
print "Stanford POS Tagged"
print stanfordPosTagList
tagged = stn.tag(stanfordPosTagList)
print tagged
Run Code Online (Sandbox Code Playgroud)
这给了我
John Donk works POI Brian Jones wants meet Xyz Corp measuring POI Short Term performance Metrics
Stanford POS Tagged
[u'John', u'Donk', u'POI', u'Brian', u'Jones', u'Xyz', u'Corp', u'POI', u'Short', u'Term']
[(u'John', u'PERSON'), (u'Donk', u'PERSON'), (u'POI', u'ORGANIZATION'), (u'Brian', u'ORGANIZATION'), (u'Jones', u'ORGANIZATION'), (u'Xyz', u'ORGANIZATION'), (u'Corp', u'ORGANIZATION'), (u'POI', u'O'), (u'Short', u'O'), (u'Term', u'O')]
Run Code Online (Sandbox Code Playgroud)
很明显,像Short和Term被标记为的东西NNP.我拥有的数据包含许多非NNP单词大写的实例.这可能是由于拼写错误或者可能是标题.我对此没有多少控制权.
我如何解析或清理数据,以便我可以检测到非NNP术语,即使它可能是大写的?我不希望术语像Short和Term被归类为NNP
此外,不确定为什么John Donk被捕获为一个人,但Brian Jones不是.可能是由于NNP我的数据中的其他大写的非?这可能对如何StanfordNERTagger对待其他一切产生影响吗?
更新,一种可能的解决方案
这是我打算做的
NNP那么我们知道原始单词也必须是NNP这是我试图做的
str = " ".join(words)
print str
stp = StanfordPOSTagger('english-bidirectional-distsim.tagger')
for word in str.split():
wl = word.lower()
print wl
w,pos = stp.tag(wl)
print pos
if pos=="NNP":
print "Got NNP"
print w
Run Code Online (Sandbox Code Playgroud)
但这给了我错误
John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics
john
Traceback (most recent call last):
File "X:\crp.py", line 37, in <module>
w,pos = stp.tag(wl)
ValueError: too many values to unpack
Run Code Online (Sandbox Code Playgroud)
我尝试了多种方法,但总会出现一些错误.我如何标记单个单词?
我不想将整个字符串转换为小写,然后Tag.如果我这样做,则StanfordPOSTagger返回一个空字符串
首先,看看你的另一个问题,设置斯坦福CoreNLP从命令行或python调用:nltk:如何防止专有名词的阻塞.
对于正确的句子,我们看到NER正常工作:
>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner', 'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O
Run Code Online (Sandbox Code Playgroud)
对于降低的句子,你不会得到NNPPOS标签或任何NER标签:
>>> for token in annotated_sent1['tokens']:
... print token['word'], token['lemma'], token['pos'], token['ner']
...
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
Run Code Online (Sandbox Code Playgroud)
所以你的问题应该是:
在回答完这些问题之后,您可以继续决定您对NER标签的真正想法,即
如果输入是低级的,那是因为你构建NLP工具链的方式,那么
如果输入是低位的,因为这是原始数据的方式,那么:
如果输入有错误的套管,例如`有些大而有些但不是全部都是正确的名词,那么
| 归档时间: |
|
| 查看次数: |
4225 次 |
| 最近记录: |