在 Spark 中过滤停用词

Jak*_*ard 2 scala apache-spark

我试图从.txt文件的 RDD 单词中过滤掉停用词。

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")
val stopWords = stopWordsInput.map(x => x.split(","))

// Create a tuple of test words
val testWords = ("you", "to")

// Split using a regular expression that extracts words
val wordsWithStopWords = input.flatMap(x => x.split("\\W+"))
Run Code Online (Sandbox Code Playgroud)

上面的代码对我来说很有意义并且似乎工作正常。这是我遇到麻烦的地方。

//Remove the stop words from the list
val words = wordsWithStopWords.filter(x => x != testWords)
Run Code Online (Sandbox Code Playgroud)

这将运行,但实际上不会过滤掉 tuple 中包含的单词testWords。我不确定如何wordsWithStopWords针对元组中的每个单词测试单词testWords

eli*_*sah 5

您可以使用 Broadcast 变量来过滤您的停用词 RDD :

// Creating the RDDs
val input = sc.textFile("../book.txt")
val stopWordsInput = sc.textFile("../stopwords.csv")

// Flatten, collect, and broadcast.
val stopWords = stopWordsInput.flatMap(x => x.split(",")).map(_.trim)
val broadcastStopWords = sc.broadcast(stopWords.collect.toSet)

// Split using a regular expression that extracts words
val wordsWithStopWords: RDD[String] = input.flatMap(x => x.split("\\W+"))
wordsWithStopWords.filter(!broadcastStopWords.value.contains(_))
Run Code Online (Sandbox Code Playgroud)

广播变量允许您在每台机器上缓存一个只读变量,而不是随任务一起传送它的副本。例如,它们可用于以有效的方式为每个节点提供大型输入数据集的副本(在这种情况下也是如此)。