如何使用python nltk加速stanford NER的NE识别

sam*_*ara 9 python nlp named-entity-recognition nltk stanford-nlp

首先,我将文件内容标记为句子,然后在每个句子上调用Stanford NER.但这个过程非常缓慢.我知道如果我在整个文件内容上调用它会更快,但我会在每个句子上调用它,因为我想在NE识别之前和之后索引每个句子.

st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
    sentences = sent_tokenize(filecontent) #break file content into sentences
    for j,sent in enumerate(sentences): 
        words = word_tokenize(sent) #tokenize sentences into words
        ne_tags = st.tag(words) #get tagged NEs from Stanford NER
Run Code Online (Sandbox Code Playgroud)

这可能是因为要求st.tag()每个句子,但有没有办法让它运行得更快?

编辑

我想要将句子分开标记的原因是我想将句子写入文件(如句子索引),以便在后期给出带有标签的句子,我可以得到未经处理的句子(我也在这里进行词形翻译) )

文件格式:

(sent_number,orig_sentence,NE_and_lemmatized_sentence)

alv*_*vas 8

StanfordNERTagger,有tag_sents()功能,见https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68

>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
Run Code Online (Sandbox Code Playgroud)


小智 6

你可以使用stanford服务器.速度会快得多.

安装sner

pip install sner
Run Code Online (Sandbox Code Playgroud)

运行服务器

cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gz
Run Code Online (Sandbox Code Playgroud)

from sner import Ner

test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))
Run Code Online (Sandbox Code Playgroud)

这段代码的结果是

[('Alice', 'PERSON'),
 ('went', 'O'),
 ('to', 'O'),
 ('the', 'O'),
 ('Museum', 'ORGANIZATION'),
 ('of', 'ORGANIZATION'),
 ('Natural', 'ORGANIZATION'),
 ('History', 'ORGANIZATION'),
 ('.', 'O')]
Run Code Online (Sandbox Code Playgroud)

更多细节看https://github.com/caihaoyu/sner