sam*_*ara 9 python nlp named-entity-recognition nltk stanford-nlp
首先,我将文件内容标记为句子,然后在每个句子上调用Stanford NER.但这个过程非常缓慢.我知道如果我在整个文件内容上调用它会更快,但我会在每个句子上调用它,因为我想在NE识别之前和之后索引每个句子.
st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
for filename in filelist:
sentences = sent_tokenize(filecontent) #break file content into sentences
for j,sent in enumerate(sentences):
words = word_tokenize(sent) #tokenize sentences into words
ne_tags = st.tag(words) #get tagged NEs from Stanford NER
Run Code Online (Sandbox Code Playgroud)
这可能是因为要求st.tag()每个句子,但有没有办法让它运行得更快?
编辑
我想要将句子分开标记的原因是我想将句子写入文件(如句子索引),以便在后期给出带有标签的句子,我可以得到未经处理的句子(我也在这里进行词形翻译) )
文件格式:
(sent_number,orig_sentence,NE_and_lemmatized_sentence)
从StanfordNERTagger,有tag_sents()功能,见https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L68
>>> st = NERTagger('stanford-ner/classifiers/english.all.3class.distsim.crf.ser.gz', 'stanford-ner/stanford-ner.jar')
>>> tokenized_sents = [[word_tokenize(sent) for sent in sent_tokenize(filecontent)] for filename in filelist]
>>> st.tag_sents(tokenized_sents)
Run Code Online (Sandbox Code Playgroud)
小智 6
你可以使用stanford服务器.速度会快得多.
安装sner
pip install sner
Run Code Online (Sandbox Code Playgroud)
运行服务器
cd your_stanford_ner_dir
java -Djava.ext.dirs=./lib -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -port 9199 -loadClassifier ./classifiers/english.all.3class.distsim.crf.ser.gzRun Code Online (Sandbox Code Playgroud)
from sner import Ner
test_string = "Alice went to the Museum of Natural History."
tagger = Ner(host='localhost',port=9199)
print(tagger.get_entities(test_string))Run Code Online (Sandbox Code Playgroud)
这段代码的结果是
[('Alice', 'PERSON'),
('went', 'O'),
('to', 'O'),
('the', 'O'),
('Museum', 'ORGANIZATION'),
('of', 'ORGANIZATION'),
('Natural', 'ORGANIZATION'),
('History', 'ORGANIZATION'),
('.', 'O')]Run Code Online (Sandbox Code Playgroud)
更多细节看https://github.com/caihaoyu/sner
| 归档时间: |
|
| 查看次数: |
4192 次 |
| 最近记录: |