如何更改 NLTK 中 POS 标记的 maxent 分类器的迭代次数?

psr*_*psr 1 python nlp classification nltk

ClassifierBasedPOSTagger我正在尝试使用with执行 POS 标记classifier_builder=MaxentClassifier.train。这是一段代码:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
size = int(len(brown_tagged_sents) * 0.9)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
print(me_tagger.evaluate(test_sents))
Run Code Online (Sandbox Code Playgroud)

但运行代码一个小时后,我发现它仍在初始化ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train). 在输出中,我可以看到以下代码正在运行:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986
Run Code Online (Sandbox Code Playgroud)

我认为在分类器准备好为任何输入标记词性之前,迭代次数将达到 100 次。我想这需要一整天的时间。为什么要花这么多时间?减少迭代次数会让这段代码变得有点实用(意味着减少时间并且仍然足够有用),如果是,那么如何减少这些迭代?

编辑

1.5 小时后,我得到以下输出:

  ==> Training (100 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -5.35659        0.007
         2          -0.85922        0.953
         3          -0.56125        0.986
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1310: RuntimeWarning: overflow encountered in power
  exp_nf_delta = 2 ** nf_delta
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1312: RuntimeWarning: invalid value encountered in multiply
  sum1 = numpy.sum(exp_nf_delta * A, axis=0)
E:\Analytics Practice\Social Media Analytics\analyticsPlatform\lib\site-packages\nltk\classify\maxent.py:1313: RuntimeWarning: invalid value encountered in multiply
  sum2 = numpy.sum(nf_exp_nf_delta * A, axis=0)
         Final               nan        0.991
0.892155885577594
Run Code Online (Sandbox Code Playgroud)

该算法是否应该达到100 iterations输出第一行中指定的结果,但由于错误而没有达到?有没有什么办法可以减少训练时间?

RAV*_*AVI 5

您可以将 的参数值设置max_iter为所需的数字。

代码:

from nltk.tag.sequential import ClassifierBasedPOSTagger
from nltk.classify import MaxentClassifier
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
# Change size based on your requirement
size = int(len(brown_tagged_sents) * 0.05)
print("size:",size)

train_sents = brown_tagged_sents[:size]
test_sents = brown_tagged_sents[size:]

#me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=MaxentClassifier.train)
me_tagger = ClassifierBasedPOSTagger(train=train_sents, classifier_builder=lambda train_feats: MaxentClassifier.train(train_feats, max_iter=15))
print(me_tagger.evaluate(test_sents))
Run Code Online (Sandbox Code Playgroud)

输出:

('size:', 231)
  ==> Training (15 iterations)

  Iteration    Log Likelihood    Accuracy
  ---------------------------------------
         1          -4.67283        0.013
         2          -0.89282        0.964
         3          -0.56137        0.998
         4          -0.40573        0.999
         5          -0.31761        0.999
         6          -0.26107        0.999
         7          -0.22175        0.999
         8          -0.19284        0.999
         9          -0.17067        0.999
        10          -0.15315        0.999
        11          -0.13894        0.999
        12          -0.12719        0.999
        13          -0.11730        0.999
        14          -0.10887        0.999
     Final          -0.10159        0.999
0.787489765499
Run Code Online (Sandbox Code Playgroud)

对于编辑

这些消息是运行时警告而不是错误。

在第四次迭代之后,它发现了Log Likelihood = nan,因此它停止了进一步处理。因此,它成为最终迭代。