小编use*_*721的帖子

在scikit-learn tf-idf矩阵中获取文档名称

我创建了一个tf-idf矩阵,但现在我想为每个文档检索前2个单词.我想传递文件ID,它应该给我前2个字.

现在,我有这个样本数据:

from sklearn.feature_extraction.text import TfidfVectorizer

d = {'doc1':"this is the first document",'doc2':"it is a sunny day"} ### corpus

test_v = TfidfVectorizer(min_df=1)    ### applied the model
t = test_v.fit_transform(d.values())
feature_names = test_v.get_feature_names() ### list of words/terms

>>> feature_names
['day', 'document', 'first', 'is', 'it', 'sunny', 'the', 'this']

>>> t.toarray()
array([[ 0.        ,  0.47107781,  0.47107781,  0.33517574,  0.        ,
     0.        ,  0.47107781,  0.47107781],
   [ 0.53404633,  0.        ,  0.        ,  0.37997836,  0.53404633,
     0.53404633,  0.        ,  0.        ]])
Run Code Online (Sandbox Code Playgroud)

我可以通过给出行号来访问矩阵,例如.

 >>> t[0,1]
   0.47107781233161794
Run Code Online (Sandbox Code Playgroud)

有没有办法可以通过文档ID访问这个矩阵?在我的情况下'doc1'和'doc2'.

谢谢

python machine-learning matrix tf-idf scikit-learn

4
推荐指数
1
解决办法
2079
查看次数

标签 统计

machine-learning ×1

matrix ×1

python ×1

scikit-learn ×1

tf-idf ×1