标签: tf-idf

AttributeError: getfeature_names 未找到；使用 scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer = vectorizer.fit(word_data)
freq_term_mat = vectorizer.transform(word_data)

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(norm="l2")
tfidf = tfidf.fit(freq_term_mat)
Ttf_idf_matrix = tfidf.transform(freq_term_mat)

voc_words = Ttf_idf_matrix.getfeature_names()
print "The num of words = ",len(voc_words)

Run Code Online (Sandbox Code Playgroud)

当我运行包含这段代码的程序时，出现以下错误：

回溯（最近一次调用）：文件“vectorize_text.py”，第 87 行，在
voc_words = Ttf_idf_matrix.getfeature_names()
文件“/home/farheen/anaconda/lib/python2.7/site->packages/scipy/sparse/ base.py", line 499, in getattr
raise AttributeError(attr + " not found")
AttributeError: get_feature_names not found

请建议我一个解决方案。

python tf-idf scikit-learn

Far*_*fer

2015 07-26

2
推荐指数

1
解决办法

8614
查看次数

Tfidvectorizer - L2 归一化向量

我想确保 TfidfVectorizer 对象返回 l2 归一化向量。我正在运行具有不同长度的文档的二元分类问题。

我正在尝试提取每个语料库的归一化向量，因此我假设我可以对 Tfidfvectorizer 矩阵的每一行求和。然而总和大于 1，我认为标准化的 copora 会将所有文档转换为 0-1 之间的范围。

vect = TfidfVectorizer(strip_accents='unicode',
stop_words=stopwords,analyzer='word', use_idf=True, tokenizer=tokenizer, ngram_range=(1,2),sublinear_tf= True , norm='l2')

tfidf = vect.fit_transform(X_train)
# sum norm l2 documents
vect_sum = tfidf.sum(axis=1)

Run Code Online (Sandbox Code Playgroud)

vect_sum 的值大于 1，我认为使用范数会导致所有向量都在 0-1 之间。我刚刚意识到 scikit learn 中的一个预处理对象 - preprocessing.normalizer。这是我应该在 Gridsearch 管道中使用的东西吗？请参阅下面的示例。

pipeline = Pipeline([
    ('plb', normalize(tfidf, norm='l2')), #<-- sklearn.preprocessing
    ('tfidf', tfidf_vectorizer),
    ('clf', MultinomialNB()),  
])

Run Code Online (Sandbox Code Playgroud)

preprocessing.normalizer 和 Tfidfvectorizer 范数参数有什么区别？

python normalization tf-idf scikit-learn

OAK*_*OAK

lucky-day

2
推荐指数

1
解决办法

2781
查看次数

Python，sklearn，it-idf 如何按“####”分割，默认空格

使用sklean tf-idf，默认使用空间分割

corpus = [  
'This is the first document.',  
'This is the second second document.',  
'And the third one.',  
'Is this the first document?'
]    

vectorizer = CountVectorizer()   
X = vectorizer.fit_transform(corpus)

Run Code Online (Sandbox Code Playgroud)

但是，我想使用这种形式：

enter code herecorpus = [  
'This####is####the####first####document.',  
'This####is####the####second####second####document.'
]
vectorizer = CountVectorizer()   
X = vectorizer.fit_transform(corpus)
tfidf=transformer.fit_transform(vectorizer.fit_transform(documents))
word=vectorizer.get_feature_names()
weight=tfidf.toarray()

Run Code Online (Sandbox Code Playgroud)

怎么做？

python split tf-idf scikit-learn

Yao*_*ian

2018 06-25

2
推荐指数

1
解决办法

1101
查看次数

将字数向量逆变换为原始文档

我正在训练一个简单的文本分类模型（目前使用 scikit-learn）。使用我使用的词汇表将我的文档样本转换为字数向量

CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)

从sklearn.feature_extraction.text。

这非常有效，我随后可以将此字数向量作为特征向量来训练我的分类器。但我不知道如何将这些字数向量逆变换为原始文档。CountVectorizer确实有一个函数inverse_transform(X)，但这只会返回唯一的非零标记。

据我所知 CountVectorizer 没有任何映射回原始文档的实现。

有人知道如何从计数向量化表示中恢复令牌的原始序列吗？是否有 Tensorflow 或任何其他模块可以实现此目的？

nlp tf-idf scikit-learn tensorflow countvectorizer

ant*_*hka

lucky-day

2
推荐指数

1
解决办法

2646
查看次数

tfidf 矢量化器和 tfidf 变压器有什么区别

我知道公式tfidf vectorizer是

Count of word/Total count * log(Number of documents / no.of documents where word is present)

Run Code Online (Sandbox Code Playgroud)

我在 scikit learn 中看到了 tfidf 转换器，我只是想区分它们。我找不到任何有用的东西。

python nltk tf-idf scikit-learn tfidfvectorizer

use*_*396

2019 02-18

2
推荐指数

3
解决办法

6138
查看次数

如何获得最高 tf-idf 分数的前 n 个术语 - 大稀疏矩阵

有这个代码：

feature_array = np.array(tfidf.get_feature_names())
tfidf_sorting = np.argsort(response.toarray()).flatten()[::-1]

n = 3
top_n = feature_array[tfidf_sorting][:n]

Run Code Online (Sandbox Code Playgroud)

来自这个答案。

我的问题是在我的稀疏矩阵太大而无法立即转换为密集矩阵（with response.toarray()）的情况下，我如何有效地做到这一点？

显然，一般的答案是将稀疏矩阵分割成块，在 for 循环中对每个块进行转换，然后将所有块的结果组合起来。

但我想具体查看执行此操作的代码。

python tf-idf python-3.x scikit-learn tfidfvectorizer

Poe*_*dit

2019 06-22

2
推荐指数

1
解决办法

1727
查看次数

如何在 k-means 聚类中使用 tfidf 值

我使用 sckit-learn 库将 K-means 聚类与 TF-IDF 结合使用。我知道 K-means 使用距离来创建集群，距离用（x 轴值，y 轴值）表示，但 tf-idf 是单个数值。我的问题是这个 tf-idf 值是如何通过 K-means 聚类转换为 (x,y) 值的。

nlp tf-idf k-means python-3.x tfidfvectorizer

Sid*_*Sid

lucky-day

2
推荐指数

1
解决办法

2645
查看次数

我的模型是否应该在训练数据集上始终给出 100% 的准确率？

from sklearn.naive_bayes import MultinomialNB # Multinomial Naive Bayes on Lemmatized Text

X_train, X_test, y_train, y_test = train_test_split(df['Rejoined_Lemmatize'], df['Product'], random_state = 0)

X_train_counts = tfidf.fit_transform(X_train)
clf = MultinomialNB().fit(X_train_counts, y_train)
y_temp = clf.predict(tfidf.transform(X_train))

Run Code Online (Sandbox Code Playgroud)

我正在训练数据集本身上测试我的模型。它给了我以下结果：

                          precision    recall  f1-score   support

               accuracy                           0.92    742500
              macro avg       0.93      0.92      0.92    742500
           weighted avg       0.93      0.92      0.92    742500

Run Code Online (Sandbox Code Playgroud)

训练数据集的准确度< 100% 是否可以接受？

python machine-learning tf-idf scikit-learn naivebayes

mri*_*ank

lucky-day

2
推荐指数

1
解决办法

1万
查看次数