khr*_*a_s 6 python nltk tf-idf scikit-learn tfidfvectorizer
我使用 sklearn 为 nltk 库中 Brown 语料库的每个类别实现了 Tf-idf。有 15 个类别,每个类别的最高分都分配给一个停用词。
默认参数是use_idf=True,所以我使用 idf 。语料库足够大,可以计算出正确的分数。所以,我不明白 - 为什么停用词被赋予高值?
import nltk, sklearn, numpy
import pandas as pd
from nltk.corpus import brown, stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('brown')
nltk.download('stopwords')
corpus = []
for c in brown.categories():
doc = ' '.join(brown.words(categories=c))
corpus.append(doc)
thisvectorizer = TfidfVectorizer()
X = thisvectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
features = thisvectorizer.get_feature_names_out()
for array in tfidf_matrix:
tfidf_per_doc = list(zip(features, array))
tfidf_per_doc.sort(key=lambda x: x[1], reverse=True)
print(tfidf_per_doc[:3])
Run Code Online (Sandbox Code Playgroud)
结果是:
[('the', 0.6893251240111703), ('and', 0.31175508121108203), ('he', 0.24393467757919754)]
[('the', 0.6907757197452503), ('of', 0.4103688069243256), ('and', 0.28727742797362427)]
[('the', 0.7263025975051108), ('of', 0.3656242079748301), ('to', 0.291070574384772)]
[('the', 0.6754696081456901), ('and', 0.31548027033056486), ('to', 0.2688347676067454)]
[('the', 0.6814989142114783), ('of', 0.45275950370682505), ('and', 0.2884682701141856)]
[('the', 0.695577697455948), ('of', 0.35341130124782577), ('and', 0.31967658612871513)]
[('the', 0.6319718467602307), ('and', 0.3252073024670836), ('of', 0.31905971640910474)]
[('the', 0.7201346766200954), ('of', 0.4283480504712354), ('and', 0.2462470090388333)]
[('the', 0.7145625245362096), ('of', 0.3795569321959571), ('and', 0.2911711705971684)]
[('the', 0.6452744438258314), ('to', 0.2965331457609836), ('and', 0.29378534827130653)]
[('the', 0.7507413874270662), ('of', 0.3364825248186412), ('and', 0.25753131787795447)]
[('the', 0.6883038024694869), ('of', 0.41770049303087814), ('and', 0.2675503490244296)]
[('the', 0.6952456562438267), ('of', 0.39285038765440655), ('and', 0.34045082029960866)]
[('the', 0.5816391566950566), ('and', 0.3731049841274644), ('to', 0.2960718382909285)]
[('the', 0.6514884130485116), ('of', 0.29645876610367955), ('to', 0.2766347756651356)]
Run Code Online (Sandbox Code Playgroud)
每个单词都是一个停用词。每个类别的大约前 15 个单词是停用词。
如果我将参数stop_words与 nltk 内置停用词一起使用,则这些值或多或少都很好。但这对我来说没有意义 - Tf-idf 应该默认降级它们,不是吗?我在某个地方犯了一个愚蠢的错误吗?
my_stop_words = stopwords.words('english')
thisvectorizer = TfidfVectorizer(stop_words=my_stop_words)
Run Code Online (Sandbox Code Playgroud)
[('said', 0.27925480211869536), ('would', 0.18907877226786665), ('man', 0.18520023334955144)]
[('one', 0.2904582969159082), ('would', 0.1989714323107254), ('new', 0.1394799739062623)]
[('would', 0.2225121466087311), ('one', 0.21533433542780428), ('new', 0.1603044497073654)]
[('would', 0.3015860042740072), ('said', 0.20105733618267146), ('one', 0.19691182409643082)]
[('state', 0.20994145654158766), ('year', 0.16516637619246616), ('fiscal', 0.1627693480477495)]
[('one', 0.27315617167196987), ('new', 0.1339515841852929), ('time', 0.12957408143413954)]
[('said', 0.25253824925464713), ('barco', 0.2297681382507305), ('one', 0.22671047376269457)]
[('af', 0.53260466412674), ('one', 0.2029977500545255), ('may', 0.12401317094240104)]
[('one', 0.29617565661385375), ('time', 0.15556701155475144), ('would', 0.14135656338388475)]
[('said', 0.22644107030344426), ('would', 0.2097909916046616), ('one', 0.1986909391388065)]
[('said', 0.2724277852935244), ('mrs', 0.19471476451838934), ('would', 0.1650670817295739)]
[('god', 0.2540052570261857), ('one', 0.18304020379411245), ('church', 0.17784155752544287)]
[('one', 0.2402151822472666), ('mr', 0.1854602509997279), ('new', 0.16073221753309752)]
[('said', 0.32053197885047946), ('would', 0.23918851593978377), ('could', 0.18980141345828996)]
[('helva', 0.34147320176374735), ('ekstrohm', 0.27116989551827), ('would', 0.2609130084842849)]
Run Code Online (Sandbox Code Playgroud)
由于您的语料库和 tfidf 计算存在问题,因此为停用词分配了较大的值。
矩阵的形状X意味着(15, 42396)您只有 15 个文档,而这些文档包含 42396 个不同的单词。
错误在于您将给定类别的所有单词连接到一个文档中,而不是在此片段中使用所有定义的文档:
for c in brown.categories():
doc = ' '.join(brown.words(categories=c))
corpus.append(doc)
Run Code Online (Sandbox Code Playgroud)
您可以将代码修改为:
for c in brown.categories():
doc = [" ".join(x) for x in brown.sents(categories=c)]
corpus.extend(doc)
Run Code Online (Sandbox Code Playgroud)
这将为每个文档创建一个条目。因此你的X矩阵将具有 的形状(57340, 42396)。
这非常重要,因为大多数文档中都会出现停用词,这将为它们分配非常低的 TFIDF 值。
您可以使用以下代码片段查看 25 个最重要的单词:
import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]
Run Code Online (Sandbox Code Playgroud)
输出:
array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
'currency', 'example', 'movies'], dtype=object)
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
1064 次 |
| 最近记录: |