使用 Gensim 中的 filter_extremes 按频率过滤标记

Question

使用 Gensim 中的 filter_extremes 按频率过滤标记

Jan*_*lly 3 python text-processing dictionary corpus gensim

我正在尝试使用 Gensim 中的 filter_extremes 函数按频率过滤掉令牌（https://radimrehurek.com/gensim/corpora/dictionary.html）。具体来说，我有兴趣过滤掉“频率低于 no_below 文档”和“频率高于 no_above 文档”中出现的单词。

id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_above = 0.600)
print(len(id2word_))

Run Code Online (Sandbox Code Playgroud)

第一个打印语句给出 11918，第二个打印语句给出 3567。但是，如果我执行以下操作：

id2word_ = corpora.Dictionary(texts)
print(len(id2word_))
id2word_.filter_extremes(no_below = 0.599)
print(len(id2word_))

Run Code Online (Sandbox Code Playgroud)

第一个打印语句给出 11918（如预期），第二个打印语句给出 11406。id2word_.filter_extremes(no_below = 0.599)和id2word_.filter_extremes(no_above = 0.600)加起来不应该等于总字数吗？然而，11406 + 3567 > 11918，那么这个总和怎么会超过语料库的单词数呢？这是没有意义的，因为过滤器应该根据文档中的解释覆盖不重叠的单词。

如果您有任何想法，我将非常感谢您的意见！谢谢！

Answer 1

小智 5

根据定义：

\n\n

no_below (int, optional) \xe2\x80\x93 Keep tokens which are contained in at least no_below \ndocuments.\n\nno_above (float, optional) \xe2\x80\x93 Keep tokens which are contained in no more than \nno_above documents (fraction of total corpus size, not an absolute number).\n

Run Code Online (Sandbox Code Playgroud)\n\n

no_below 是一个 int 值，表示过滤掉文档中超过一定数量的 token 出现次数的阈值。例如，使用 no_below 过滤掉出现次数少于 10 次的单词。

\n\n

相反，no_above 不是 int 而是 float，表示语料库总大小的一部分。例如，使用 no_above 过滤掉出现在所有文档中超过 10% 的单词。

\n\n

有点奇怪的是 no_below 和 no_above 不代表相同的单位，因此会造成混乱。

\n\n

希望这能回答您的问题。

\n

归档时间：	7 年，3 月前
查看次数：	9469 次
最近记录：	5 年，4 月前