在 Spark 中过滤停用词

Question

在 Spark 中过滤停用词

我试图从.txt文件的 RDD 单词中过滤掉停用词。

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))

// Create a tuple of test words
val testWords = ("you", "to")

// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))

Run Code Online (Sandbox Code Playgroud)

上面的代码对我来说很有意义并且似乎工作正常。这是我遇到麻烦的地方。

//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)

Run Code Online (Sandbox Code Playgroud)

这将运行，但实际上不会过滤掉 tuple 中包含的单词testWords。我不确定如何wordsWithStopWords针对元组中的每个单词测试单词testWords

Answer 1

eli*_*sah 5

您可以使用 Broadcast 变量来过滤您的停用词 RDD ：

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)

// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))

Run Code Online (Sandbox Code Playgroud)

广播变量允许您在每台机器上缓存一个只读变量，而不是随任务一起传送它的副本。例如，它们可用于以有效的方式为每个节点提供大型输入数据集的副本（在这种情况下也是如此）。

归档时间：	9 年，6 月前
查看次数：	10909 次
最近记录：	6 年，2 月前