the*_*law 1 python nltk stop-words
我a中有5个句子np.array,我想找到出现最常见的n个单词。例如,如果n=5我想要 5 个最常见的单词。我有一个例子如下:
0 rt my mother be on school amp race
1 rt i am a red hair down and its a great
2 rt my for your every day and my chocolate
3 rt i am that red human being a man
4 rt my mother be on school and wear
Run Code Online (Sandbox Code Playgroud)
以下是我用来获取最常见的 n 个单词的代码。
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
A = np.array(["rt my mother be on school amp race",
"rt i am a red hair down and its a great",
"rt my for your every day and my chocolate",
"rt i am that red human being a man",
"rt my mother be on school and wear"])
n = 5
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(A)
vocabulary = vectorizer.get_feature_names()
ind = np.argsort(X.toarray().sum(axis=0))[-n:]
top_n_words = [vocabulary[a] for a in ind]
print(top_n_words)
Run Code Online (Sandbox Code Playgroud)
结果如下:
['school', 'am', 'and', 'my', 'rt']
Run Code Online (Sandbox Code Playgroud)
然而,我想要的是忽略这些最常见单词中的“ and”、“ am” and”等停用词。my我怎样才能做到这一点?
您只需要包含参数stop_words='english'即可CountVectorizer()
vectorizer = CountVectorizer(stop_words='english')
Run Code Online (Sandbox Code Playgroud)
您现在应该得到:
['wear', 'mother', 'red', 'school', 'rt']
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
4991 次 |
| 最近记录: |