gensim.corpora.Dictionary 是否保存了词频？

Question

gensim.corpora.Dictionary 是否保存了词频？

alv*_*vas 6 python dictionary frequency tf-idf gensim

gensim.corpora.Dictionary 是否保存了词频？

从gensim.corpora.Dictionary，可以获得单词的文档频率（即特定单词出现在多少文档中）：

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')

Run Code Online (Sandbox Code Playgroud)

[出去]：

The word "these" appears in 1213 documents

Run Code Online (Sandbox Code Playgroud)

还有一个filter_n_most_frequent(remove_n)函数可以删除第 n 个最常见的标记：

filter_n_most_frequent(remove_n) 过滤掉出现在文档中的“remove_n”最频繁的标记。

修剪后，缩小单词 id 中产生的间隙。

注意：由于间隔缩小，调用该函数前后，同一个词可能会有不同的词id！

该filter_n_most_frequent函数是否根据文档频率或词频删除第 n 个最频繁的函数？

如果是后者，是否有某种方法可以访问gensim.corpora.Dictionary对象中单词的词频？

Answer 1

uba*_*dub 7

不，gensim.corpora.Dictionary不保存词频。您可以在此处查看源代码。该类只存储以下成员变量：

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix

Run Code Online (Sandbox Code Playgroud)

这意味着类中的所有内容都将频率定义为文档频率，而不是术语频率，因为后者永远不会全局存储。这适用于filter_n_most_frequent(remove_n)以及所有其他方法。

归档时间：	8 年，2 月前
查看次数：	9876 次
最近记录：	4 年，7 月前