从 Python 句子集中最常见的单词中删除停用词

Question

从 Python 句子集中最常见的单词中删除停用词

我a中有5个句子np.array，我想找到出现最常见的n个单词。例如，如果n=5我想要 5 个最常见的单词。我有一个例子如下：

0    rt my mother be on school amp race
1    rt i am a red hair down and its a great
2    rt my for your every day and my chocolate
3    rt i am that red human being a man
4    rt my mother be on school and wear

Run Code Online (Sandbox Code Playgroud)

以下是我用来获取最常见的 n 个单词的代码。

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

A = np.array(["rt my mother be on school amp race", 
              "rt i am a red hair down and its a great", 
              "rt my for your every day and my chocolate",
              "rt i am that red human being a man",
              "rt my mother be on school and wear"])

        n = 5
        vectorizer = CountVectorizer()
        X = vectorizer.fit_transform(A)

        vocabulary = vectorizer.get_feature_names()
        ind = np.argsort(X.toarray().sum(axis=0))[-n:]

        top_n_words = [vocabulary[a] for a in ind]

        print(top_n_words)

Run Code Online (Sandbox Code Playgroud)

结果如下：

['school', 'am', 'and', 'my', 'rt']

Run Code Online (Sandbox Code Playgroud)

然而，我想要的是忽略这些最常见单词中的“ and”、“ am” and”等停用词。my我怎样才能做到这一点？

Answer 1

Ank*_*nha 5

您只需要包含参数stop_words='english'即可CountVectorizer()

vectorizer = CountVectorizer(stop_words='english')

Run Code Online (Sandbox Code Playgroud)

您现在应该得到：

['wear', 'mother', 'red', 'school', 'rt']

Run Code Online (Sandbox Code Playgroud)

请参阅此处的文档：https://scikit-learn.org/stable/modules/ generated/sklearn.feature_extraction.text.CountVectorizer.html

归档时间：	6 年，4 月前
查看次数：	4991 次
最近记录：	6 年，4 月前