在计算文本中单词准确度的频率时,如何忽略某些单词？

Question

在计算文本中单词准确度的频率时,如何忽略某些单词？

Kla*_*sos 1 python text python-2.7 pandas scikit-learn

在计算文本中单词准确度的频率时,如何忽略"a","the"之类的单词？

import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

df= pd.DataFrame({'phrase': pd.Series('The large distance between cities. The small distance. The')})
f = CountVectorizer().build_tokenizer()(str(df['phrase']))

result = collections.Counter(f).most_common(1)

print result

Run Code Online (Sandbox Code Playgroud)

答案是The.但我想把距离作为最常用的词.

Answer 1

lie*_*480 5

最好避免像这样开始计算条目.

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

Run Code Online (Sandbox Code Playgroud)

归档时间：	10 年，7 月前
查看次数：	859 次
最近记录：	10 年，7 月前