使用sklearn为python中的n-gram计算TF-IDF

7 python nlp tf-idf scikit-learn

我有一个包含n-gram的词汇表,如下所示.

myvocabulary = ['tim tam', 'jam', 'fresh milk', 'chocolates', 'biscuit pudding']
Run Code Online (Sandbox Code Playgroud)

我想用这些词来计算TF-IDF值.

我还有一个语料库字典如下(键=食谱号,值=食谱).

corpus = {1: "making chocolates biscuit pudding easy first get your favourite biscuit chocolates", 2: "tim tam drink new recipe that yummy and tasty more thicker than typical milkshake that uses normal chocolates", 3: "making chocolates drink different way using fresh milk egg"}
Run Code Online (Sandbox Code Playgroud)

我目前正在使用以下代码.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
Run Code Online (Sandbox Code Playgroud)

现在我正在打印令牌或n-gram的配方1 corpus以及tF-IDF值,如下所示.

feature_names = tfidf.get_feature_names()
doc = 0
feature_index = tfs[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfs[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print(w, s)
Run Code Online (Sandbox Code Playgroud)

我得到的结果是chocolates 1.0.但是,我的代码不会检测n-gram(bigrams),例如biscuit pudding在计算TF-IDF值时.请让我知道我在哪里编写错误代码.

我想myvocabulary通过使用中的配方文档获得TD-IDF矩阵的术语corpus.换句话说,矩阵的行代表myvocabulary,矩阵的列代表我的食谱文档corpus.请帮我.

σηγ*_*σηγ 15

尝试增加ngram_rangein TfidfVectorizer:

tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english', ngram_range=(1,2))
Run Code Online (Sandbox Code Playgroud)

编辑:输出TfidfVectorizer是稀疏格式的TF-IDF矩阵(或实际上是你所寻找的格式的转置).你可以打印出它的内容,例如:

feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
rows, cols = tfs.nonzero()
for row, col in zip(rows, cols):
    print((feature_names[col], corpus_index[row]), tfs[row, col])
Run Code Online (Sandbox Code Playgroud)

哪个应该屈服

('biscuit pudding', 1) 0.646128915046
('chocolates', 1) 0.763228291628
('chocolates', 2) 0.508542320378
('tim tam', 2) 0.861036995944
('chocolates', 3) 0.508542320378
('fresh milk', 3) 0.861036995944
Run Code Online (Sandbox Code Playgroud)

如果矩阵不大,则可能更容易以密集形式检查它.Pandas使这非常方便:

import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(df)
Run Code Online (Sandbox Code Playgroud)

这导致了

                        1         2         3
tim tam          0.000000  0.861037  0.000000
jam              0.000000  0.000000  0.000000
fresh milk       0.000000  0.000000  0.861037
chocolates       0.763228  0.508542  0.508542
biscuit pudding  0.646129  0.000000  0.000000
Run Code Online (Sandbox Code Playgroud)