dha*_*025 5 javascript analytics stemming data-mining
嗨,我正在寻找一个库,该库将从文本中删除停用词Javascript,我的最终目标是计算tf-idf,然后将给定的文档转换为向量空间,而所有这些都是Javascript。任何人都可以将我指向一个可以帮助我做到这一点的库。只需一个库来删除停用词也很棒。
使用NLTK 库提供的停用词:
stopwords = ['i','me','my','myself','we','our','ours','ourselves','you','your','yours','yourself','yourselves','he','him','his','himself','she','her','hers','herself','it','its','itself','they','them','their','theirs','themselves','what','which','who','whom','this','that','these','those','am','is','are','was','were','be','been','being','have','has','had','having','do','does','did','doing','a','an','the','and','but','if','or','because','as','until','while','of','at','by','for','with','about','against','between','into','through','during','before','after','above','below','to','from','up','down','in','out','on','off','over','under','again','further','then','once','here','there','when','where','why','how','all','any','both','each','few','more','most','other','some','such','no','nor','not','only','own','same','so','than','too','very','s','t','can','will','just','don','should','now']
Run Code Online (Sandbox Code Playgroud)
然后只需将您的字符串传递给以下函数:
function remove_stopwords(str) {
res = []
words = str.split(' ')
for(i=0;i<words.length;i++) {
word_clean = words[i].split(".").join("")
if(!stopwords.includes(word_clean)) {
res.push(word_clean)
}
}
return(res.join(' '))
}
Run Code Online (Sandbox Code Playgroud)
示例:
remove_stopwords("I will go to the place where there are things for me.")
Run Code Online (Sandbox Code Playgroud)
结果:
I go place things
Run Code Online (Sandbox Code Playgroud)
只需向您的 NLTK 数组中添加尚未涵盖的任何单词。
我认为没有这样的图书馆,你需要从https://www.ranks.nl/stopwords下载这些词。
然后按如下方式替换单词:
text = text.replace(stopword, "")
Run Code Online (Sandbox Code Playgroud)