gensim.corpora.Dictionary 是否保存了词频?

alv*_*vas 6 python dictionary frequency tf-idf gensim

gensim.corpora.Dictionary 是否保存了词频?

gensim.corpora.Dictionary,可以获得单词的文档频率(即特定单词出现在多少文档中):

from nltk.corpus import brown
from gensim.corpora import Dictionary

documents = brown.sents()
brown_dict = Dictionary(documents)

# The 100th word in the dictionary: 'these'
print('The word "' + brown_dict[100] + '" appears in', brown_dict.dfs[100],'documents')
Run Code Online (Sandbox Code Playgroud)

[出去]:

The word "these" appears in 1213 documents
Run Code Online (Sandbox Code Playgroud)

还有一个filter_n_most_frequent(remove_n)函数可以删除第 n 个最常见的标记:

filter_n_most_frequent(remove_n) 过滤掉出现在文档中的“remove_n”最频繁的标记。

修剪后,缩小单词 id 中产生的间隙。

注意:由于间隔缩小,调用该函数前后,同一个词可能会有不同的词id!

filter_n_most_frequent函数是否根据文档频率或词频删除第 n 个最频繁的函数?

如果是后者,是否有某种方法可以访问gensim.corpora.Dictionary对象中单词的词频?

uba*_*dub 7

不,gensim.corpora.Dictionary不保存词频。您可以在此处查看源代码。该类只存储以下成员变量:

    self.token2id = {}  # token -> tokenId
    self.id2token = {}  # reverse mapping for token2id; only formed on request, to save memory
    self.dfs = {}  # document frequencies: tokenId -> in how many documents this token appeared

    self.num_docs = 0  # number of documents processed
    self.num_pos = 0  # total number of corpus positions
    self.num_nnz = 0  # total number of non-zeroes in the BOW matrix
Run Code Online (Sandbox Code Playgroud)

这意味着类中的所有内容都将频率定义为文档频率,而不是术语频率,因为后者永远不会全局存储。这适用于filter_n_most_frequent(remove_n)以及所有其他方法。