NLTK的BLEU分数和SacreBLEU有什么区别？

Question

NLTK的BLEU分数和SacreBLEU有什么区别？

Sea*_*ala 4 nltk machine-translation bleu

我很好奇是否有人熟悉使用NLTK 的 BLEU 分数计算和SacreBLEU 库之间的区别。

特别是，我使用两个库的句子 BLEU 分数，对整个数据集进行平均。两者给出不同的结果：

>>> from nltk.translate import bleu_score
>>> from sacrebleu import sentence_bleu
>>> print(len(predictions))
256
>>> print(len(targets))
256
>>> prediction = "this is the first: the world's the world's the world's the \
... world's the world's the world's the world's the world's the world's the world \
... of the world of the world'"
...
>>> target = "al gore: so the alliance for climate change has launched two campaigns."
>>> print(bleu_score.sentence_bleu([target], prediction))
0.05422283394039736
>>> print(sentence_bleu(prediction, [target]).score)
0.0
>>> print(sacrebleu.corpus_bleu(predictions, [targets]).score)
0.678758518214081
>>> print(bleu_score.corpus_bleu([targets], [predictions]))
0

Run Code Online (Sandbox Code Playgroud)

正如您所看到的，存在很多令人困惑的不一致之处。我的 BLEU 分数不可能是 67.8%，但它也不应该是 0%（有很多重叠的 n 元语法，如“the”）。

如果有人能对此有所了解，我将不胜感激。谢谢。

Answer 1

Jin*_*ich 6

NLTK 和 SacreBLEU 使用不同的标记化规则，主要是在处理标点符号的方式上。NLTK 使用自己的标记化，而 SacreBLEU 复制了 2002 年的原始 Perl 实现。NLTK 中的标记化规则可能更详细，但它们使得数量与原始实现无法相比。

\n

从 SacreBLEU 获得的语料库 BLEU 不是 67.8%，而是 0.67% \xe2\x80\x93，与 NLTK 不同，来自 SacreBLEU 的数字已经乘以 100。所以，我不会说分数之间存在巨大差异。

\n

句子级 BLEU 可以使用不同的平滑技术，即使 4 克精度中的 3 克精度为零，也应确保分数得到合理的值。但请注意，BLEU 作为句子级指标非常不可靠。

\n

如果您需要评估一个系统而不是单个句子，那么您应该使用语料库 BLEU。 (2认同)

归档时间：	5 年前
查看次数：	5604 次
最近记录：	5 年前