如何使用RDD计算文本文件中每行的单词数?

Kir*_*hat 2 scala apache-spark

有没有一种方法可以使用map和reduce来计算RDD每一行的单词出现次数,而不是完整的RDD?

例如,如果 RDD[String] 包含以下两行:

让我们玩得开心吧。

为了获得乐趣,您不需要任何计划。

那么输出应该类似于包含键值对的映射:

(“让我们”,1)
(“有”,1)
(“一些”,1)
(“有趣”,1)

(“到”,1)
(“有”,1)
(“乐趣”,1)
(“你”,1)
(“不”,1)
(“需要”,1)
(“计划”,1 )

Jac*_*ski 5

如果您刚刚开始使用 Spark 并且没有人告诉您使用 RDD API,请不要使用它。有很多更好、通常更高效的 Spark SQL API 可以在 Spark 中的大型数据集上执行此操作以及许多其他分布式计算。

使用 RDD API 就像使用汇编程序来完成可以使用 Scala(或其他高级编程语言)的任务。在开始 Spark 之旅时,我个人首先推荐使用带有 DataFrame 和 Datasets 的 Spark SQL 的更高级别 API,这确实需要考虑太多。


给定输入:

$ cat input.txt
Let's have some fun.

To have fun you don't need any plans.
Run Code Online (Sandbox Code Playgroud)

并且您要使用 Dataset API,您可以执行以下操作:

val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line                                 |words |
+-------------------------------------+------+
|Let's have some fun.                 |Let's |
|Let's have some fun.                 |have  |
|Let's have some fun.                 |some  |
|Let's have some fun.                 |fun.  |
|                                     |      |
|To have fun you don't need any plans.|To    |
|To have fun you don't need any plans.|have  |
|To have fun you don't need any plans.|fun   |
|To have fun you don't need any plans.|you   |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need  |
|To have fun you don't need any plans.|any   |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+

scala> wordsPerLine.
  groupBy("line", "words").
  count.
  withColumn("word_count", struct($"words", $"count")).
  select("line", "word_count").
  groupBy("line").
  agg(collect_set("word_count")).
  show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line                                 |collect_set(word_count)                                                       |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun.                 |[[have,1], [fun.,1], [Let's,1], [some,1]]                                     |
|                                     |[[,1]]                                                                        |
+-------------------------------------+------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)

完毕。很简单,不是吗?

请参阅函数对象(有关explodestruct函数)。