为什么`gensim`中的tf-idf模型在转换语料库后会抛弃术语和计数?

alv*_*vas 2 python nlp information-retrieval tf-idf gensim

为什么tf-idf模型gensim会在我转换语料库后抛弃术语和计数?

我的代码:

from gensim import corpora, models, similarities

# Let's say you have a corpus made up of 2 documents.
doc0 = [(0, 1), (1, 1)]
doc1 = [(0,1)]
doc2 = [(0, 1), (1, 1)]
doc3 = [(0, 3), (1, 1)]

corpus = [doc0,doc1,doc2,doc3]

# Train a tfidf model using the corpus
tfidf = models.TfidfModel(corpus)

# Now if you print the corpus, it still remains as the flat frequency counts.
for d in corpus:
  print d
print 

# To convert the corpus into tfidf, re-initialize the corpus 
# according to the model to get the normalized frequencies.
corpus = tfidf[corpus]

for d in corpus:
  print d
Run Code Online (Sandbox Code Playgroud)

输出:

[(0, 1.0), (1, 1.0)]
[(0, 1.0)]
[(0, 1.0), (1, 1.0)]
[(0, 3.0), (1, 1.0)]

[(1, 1.0)]
[]
[(1, 1.0)]
[(1, 1.0)]
Run Code Online (Sandbox Code Playgroud)

小智 6

IDF是通过将文档总数除以包含该项的文档数得到的,然后取该商的对数.在您的情况下,所有文档都有term0,因此term0的IDF是log(1),等于0.因此在doc-term矩阵中,term0的列全部为零.

所有文件中出现的术语都没有重量,它完全没有信息.