为什么 sklearn tf-idf 矢量器给停用词最高分？

Question

为什么 sklearn tf-idf 矢量器给停用词最高分？

khr*_*a_s 6 python nltk tf-idf scikit-learn tfidfvectorizer

我使用 sklearn 为 nltk 库中 Brown 语料库的每个类别实现了 Tf-idf。有 15 个类别，每个类别的最高分都分配给一个停用词。

默认参数是use_idf=True，所以我使用 idf 。语料库足够大，可以计算出正确的分数。所以，我不明白 - 为什么停用词被赋予高值？

import nltk, sklearn, numpy
import pandas as pd
from nltk.corpus import brown, stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('brown')
nltk.download('stopwords')

corpus = []
for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

thisvectorizer = TfidfVectorizer()
X = thisvectorizer.fit_transform(corpus)
tfidf_matrix = X.toarray()
features = thisvectorizer.get_feature_names_out()

for array in tfidf_matrix:
  tfidf_per_doc = list(zip(features, array))
  tfidf_per_doc.sort(key=lambda x: x[1], reverse=True)
  print(tfidf_per_doc[:3])

Run Code Online (Sandbox Code Playgroud)

结果是：

[('the', 0.6893251240111703), ('and', 0.31175508121108203), ('he', 0.24393467757919754)]
[('the', 0.6907757197452503), ('of', 0.4103688069243256), ('and', 0.28727742797362427)]
[('the', 0.7263025975051108), ('of', 0.3656242079748301), ('to', 0.291070574384772)]
[('the', 0.6754696081456901), ('and', 0.31548027033056486), ('to', 0.2688347676067454)]
[('the', 0.6814989142114783), ('of', 0.45275950370682505), ('and', 0.2884682701141856)]
[('the', 0.695577697455948), ('of', 0.35341130124782577), ('and', 0.31967658612871513)]
[('the', 0.6319718467602307), ('and', 0.3252073024670836), ('of', 0.31905971640910474)]
[('the', 0.7201346766200954), ('of', 0.4283480504712354), ('and', 0.2462470090388333)]
[('the', 0.7145625245362096), ('of', 0.3795569321959571), ('and', 0.2911711705971684)]
[('the', 0.6452744438258314), ('to', 0.2965331457609836), ('and', 0.29378534827130653)]
[('the', 0.7507413874270662), ('of', 0.3364825248186412), ('and', 0.25753131787795447)]
[('the', 0.6883038024694869), ('of', 0.41770049303087814), ('and', 0.2675503490244296)]
[('the', 0.6952456562438267), ('of', 0.39285038765440655), ('and', 0.34045082029960866)]
[('the', 0.5816391566950566), ('and', 0.3731049841274644), ('to', 0.2960718382909285)]
[('the', 0.6514884130485116), ('of', 0.29645876610367955), ('to', 0.2766347756651356)]

Run Code Online (Sandbox Code Playgroud)

每个单词都是一个停用词。每个类别的大约前 15 个单词是停用词。

如果我将参数stop_words与 nltk 内置停用词一起使用，则这些值或多或少都很好。但这对我来说没有意义 - Tf-idf 应该默认降级它们，不是吗？我在某个地方犯了一个愚蠢的错误吗？

my_stop_words = stopwords.words('english')
thisvectorizer = TfidfVectorizer(stop_words=my_stop_words)

Run Code Online (Sandbox Code Playgroud)

[('said', 0.27925480211869536), ('would', 0.18907877226786665), ('man', 0.18520023334955144)]
[('one', 0.2904582969159082), ('would', 0.1989714323107254), ('new', 0.1394799739062623)]
[('would', 0.2225121466087311), ('one', 0.21533433542780428), ('new', 0.1603044497073654)]
[('would', 0.3015860042740072), ('said', 0.20105733618267146), ('one', 0.19691182409643082)]
[('state', 0.20994145654158766), ('year', 0.16516637619246616), ('fiscal', 0.1627693480477495)]
[('one', 0.27315617167196987), ('new', 0.1339515841852929), ('time', 0.12957408143413954)]
[('said', 0.25253824925464713), ('barco', 0.2297681382507305), ('one', 0.22671047376269457)]
[('af', 0.53260466412674), ('one', 0.2029977500545255), ('may', 0.12401317094240104)]
[('one', 0.29617565661385375), ('time', 0.15556701155475144), ('would', 0.14135656338388475)]
[('said', 0.22644107030344426), ('would', 0.2097909916046616), ('one', 0.1986909391388065)]
[('said', 0.2724277852935244), ('mrs', 0.19471476451838934), ('would', 0.1650670817295739)]
[('god', 0.2540052570261857), ('one', 0.18304020379411245), ('church', 0.17784155752544287)]
[('one', 0.2402151822472666), ('mr', 0.1854602509997279), ('new', 0.16073221753309752)]
[('said', 0.32053197885047946), ('would', 0.23918851593978377), ('could', 0.18980141345828996)]
[('helva', 0.34147320176374735), ('ekstrohm', 0.27116989551827), ('would', 0.2609130084842849)]

Run Code Online (Sandbox Code Playgroud)

Answer 1

Ant*_*uis 3

由于您的语料库和 tfidf 计算存在问题，因此为停用词分配了较大的值。

矩阵的形状X意味着(15, 42396)您只有 15 个文档，而这些文档包含 42396 个不同的单词。

错误在于您将给定类别的所有单词连接到一个文档中，而不是在此片段中使用所有定义的文档：

for c in brown.categories():
  doc = ' '.join(brown.words(categories=c))
  corpus.append(doc)

Run Code Online (Sandbox Code Playgroud)

您可以将代码修改为：

for c in brown.categories():
    doc = [" ".join(x) for x in brown.sents(categories=c)]
    corpus.extend(doc)

Run Code Online (Sandbox Code Playgroud)

这将为每个文档创建一个条目。因此你的X矩阵将具有的形状(57340, 42396)。

这非常重要，因为大多数文档中都会出现停用词，这将为它们分配非常低的 TFIDF 值。

您可以使用以下代码片段查看 25 个最重要的单词：

import numpy as np
feature_names = thisvectorizer.get_feature_names_out()
sorted_nzs = np.argsort(X.data)[:-(25):-1]
feature_names[X.indices[sorted_nzs]]

Run Code Online (Sandbox Code Playgroud)

输出：

 array(['customer', 'asked', 'properties', 'itch', 'locked', 'achieving',
        'jack', 'guess', 'criticality', 'me', 'sir', 'beckworth', 'visa',
        'will', 'casey', 'athletics', 'norms', 'yeah', 'eh', 'oh', 'af',
        'currency', 'example', 'movies'], dtype=object)

Run Code Online (Sandbox Code Playgroud)

归档时间：	4 年，1 月前
查看次数：	1064 次
最近记录：	4 年，1 月前