psr*_*psr 1 python nlp classification nltk
ClassifierBasedPOSTagger我正在尝试使用with执行 POS 标记classifier_builder=MaxentClassifier.train。这是一段代码:
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))
Run Code Online (Sandbox Code Playgroud)
但运行代码一个小时后,我发现它仍在初始化ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train). 在输出中,我可以看到以下代码正在运行:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
Run Code Online (Sandbox Code Playgroud)
我认为在分类器准备好为任何输入标记词性之前,迭代次数将达到 100 次。我想这需要一整天的时间。为什么要花这么多时间?减少迭代次数会让这段代码变得有点实用(意味着减少时间并且仍然足够有用),如果是,那么如何减少这些迭代?
编辑
1.5 小时后,我得到以下输出:
==> Training (100 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -5.35659 0.007
2 -0.85922 0.953
3 -0.56125 0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
Final nan 0.991
0.892155885577594
Run Code Online (Sandbox Code Playgroud)
该算法是否应该达到100 iterations输出第一行中指定的结果,但由于错误而没有达到?有没有什么办法可以减少训练时间?
您可以将 的参数值设置max_iter为所需的数字。
代码:
from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)
train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]
#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))
Run Code Online (Sandbox Code Playgroud)
输出:
('size:', 231)
==> Training (15 iterations)
Iteration Log Likelihood Accuracy
---------------------------------------
1 -4.67283 0.013
2 -0.89282 0.964
3 -0.56137 0.998
4 -0.40573 0.999
5 -0.31761 0.999
6 -0.26107 0.999
7 -0.22175 0.999
8 -0.19284 0.999
9 -0.17067 0.999
10 -0.15315 0.999
11 -0.13894 0.999
12 -0.12719 0.999
13 -0.11730 0.999
14 -0.10887 0.999
Final -0.10159 0.999
0.787489765499
Run Code Online (Sandbox Code Playgroud)
对于编辑:
这些消息是运行时警告而不是错误。
在第四次迭代之后,它发现了Log Likelihood = nan,因此它停止了进一步处理。因此,它成为最终迭代。