Gensim lda 给出负对数困惑值 - 这是正常的吗？我该如何解释它？

我目前正在使用 Gensim LDA 进行主题建模。

在调整超参数时，我发现该模型总是给出负对数困惑度

模特有这样的表现正常吗？（这可能吗？）

如果是的话，较小的困惑是否比较大的困惑更好？（-100 比 -20 好？？）

lda gensim perplexity

now*_*ogo

lucky-day

6
推荐指数

1
解决办法

958
查看次数

如何解释 Sklearn LDA 困惑度分数。为什么它总是随着主题数量的增加而增加？

我尝试使用 sklearn 的 LDA 模型找到最佳主题数。为此，我通过在https://gist.github.com/tmylk/b71bf7d3ec2f203bfce2上引用代码来计算困惑度。

但是当我增加话题数量时，困惑总是不合理地增加。我在实现上错了还是只是给出了正确的值？

from __future__ import print_function
from time import time

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
n_samples = 0.7
n_features = 1000
n_top_words = 20
dataset = kickstarter['short_desc'].tolist()
data_samples = dataset[:int(len(dataset)*n_samples)]
test_samples = dataset[int(len(dataset)*n_samples):]

Run Code Online (Sandbox Code Playgroud)

对 LDA 使用 tf（原始术语计数）功能。

print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
                                max_features=n_features,
                                stop_words='english')
t0 = time()
tf = tf_vectorizer.fit_transform(data_samples)
print("done in %0.3fs." % (time() - t0))
# Use tf (raw term count) features for …

Run Code Online (Sandbox Code Playgroud)

python topic-modeling scikit-learn perplexity

Jon*_*Kim

2017 08-17

5
推荐指数

1
解决办法

8008
查看次数

Huggingface gpt2语言模型代码中perplexity计算在哪里？

我看到一些 github 评论说 model() 调用的损失的输出是困惑的形式：https : //github.com/huggingface/transformers/issues/473

    if labels is not None:
        # Shift so that tokens < n predict n
        shift_logits = lm_logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
        loss_fct = CrossEntropyLoss()
        loss = loss_fct(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))
        outputs = (loss,) + outputs

    return outputs  # (loss), lm_logits, (all hidden states), (all attentions)

Run Code Online (Sandbox Code Playgroud)

我看到正在计算交叉熵，但没有转换为困惑。损失最终在哪里转化？或者是否已经存在我不理解的转变？

machine-learning gpt perplexity huggingface-transformers

use*_*659

2020 07-01

5
推荐指数

1
解决办法

2502
查看次数

困惑随着主题数量的增加而增加

有很多关于这个具体问题的帖子，但我无法解决这个问题。我一直在 20newgroup 语料库上使用 Sklearn 和 Gensim 实现来试验 LDA。文献中描述，随着主题数量的增加，困惑度通常会降低，但我得到了不同的结果。

我已经尝试过不同的参数，但总的来说，当主题数量增加时，测试集的困惑度会增加，训练集的困惑度会减少。这可能表明模型在训练集上过度拟合。但使用其他文本数据集时也会出现类似的模式。此外，专门使用该数据集的研究也减少了困惑。（例如ng20 困惑度）

我已经尝试过 SkLearn、Gensim 和 Gensim Mallet 包装器，所有包确实显示出不同的困惑度值（这是可以预期的，因为 LDA 是随机初始化 + 不同的推理算法），但常见的模式是每个包的困惑度确实增加，这与文献中的许多论文相矛盾。

# imports for code sample
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import LatentDirichletAllocation

Run Code Online (Sandbox Code Playgroud)

小示例代码

# retrieve the data
newsgroups_all = datasets.fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'), shuffle = True)
print("Extracting tf features for LDA...")
tf_vectorizer_train = CountVectorizer(max_df=0.95, min_df=2,stop_words='english')
X = tf_vectorizer_train.fit_transform(newsgroups_all.data)
X_train, X_test = train_test_split(X,  test_size=0.2, random_state=42)

Run Code Online (Sandbox Code Playgroud)

k = N
lda = …

Run Code Online (Sandbox Code Playgroud)

python lda topic-modeling scikit-learn perplexity

Bas*_*Bas

2019 07-02

5
推荐指数

0
解决办法

1519
查看次数

NLP 中良好的困惑度值是否有特定的范围？

我正在微调语言模型，并计算训练和验证损失以及训练和验证困惑度。在我的程序中，它是通过损失的指数来计算的。我知道较低的困惑度代表更好的语言模型，并且想知道一个好的模型的值范围是多少。任何帮助表示赞赏。谢谢。

nlp neural-network deep-learning language-model perplexity

D.P*_*era

2020 06-24

5
推荐指数

0
解决办法

552
查看次数

如何使用KenLM计算困惑？

假设我们以此为基础建立了一个模型：

$ wget https://gist.githubusercontent.com/alvations/1c1b388456dc3760ffb487ce950712ac/raw/86cdf7de279a2b9bceeb3adb481e42691d12fbba/something.txt
$ lmplz -o 5 < something.txt > something.arpa

Run Code Online (Sandbox Code Playgroud)

从困惑公式（https://web.stanford.edu/class/cs124/lec/languagemodeling.pdf）

应用逆对数公式的总和以获取内部变量，然后取第n个根，则困惑度异常小：

>>> import kenlm
>>> m = kenlm.Model('something.arpa')

# Sentence seen in data.
>>> s = 'The development of a forward-looking and comprehensive European migration policy,'
>>> list(m.full_scores(s))
[(-0.8502398729324341, 2, False), (-3.0185394287109375, 3, False), (-0.3004383146762848, 4, False), (-1.0249041318893433, 5, False), (-0.6545327305793762, 5, False), (-0.29304179549217224, 5, False), (-0.4497605562210083, 5, False), (-0.49850910902023315, 5, False), (-0.3856896460056305, 5, False), (-0.3572353720664978, 5, False), (-1.7523181438446045, 1, False)]
>>> n = len(s.split()) …

Run Code Online (Sandbox Code Playgroud)

python nlp language-model kenlm perplexity

alv*_*vas

lucky-day

3
推荐指数

1
解决办法

2748
查看次数