Kir*_*hat 2 scala apache-spark
有没有一种方法可以使用map和reduce来计算RDD每一行的单词出现次数,而不是完整的RDD?
例如,如果 RDD[String] 包含以下两行:
让我们玩得开心吧。
为了获得乐趣,您不需要任何计划。
那么输出应该类似于包含键值对的映射:
(“让我们”,1)
(“有”,1)
(“一些”,1)
(“有趣”,1)(“到”,1)
(“有”,1)
(“乐趣”,1)
(“你”,1)
(“不”,1)
(“需要”,1)
(“计划”,1 )
如果您刚刚开始使用 Spark 并且没有人告诉您使用 RDD API,请不要使用它。有很多更好、通常更高效的 Spark SQL API 可以在 Spark 中的大型数据集上执行此操作以及许多其他分布式计算。
使用 RDD API 就像使用汇编程序来完成可以使用 Scala(或其他高级编程语言)的任务。在开始 Spark 之旅时,我个人首先推荐使用带有 DataFrame 和 Datasets 的 Spark SQL 的更高级别 API,这确实需要考虑太多。
给定输入:
$ cat input.txt
Let's have some fun.
To have fun you don't need any plans.
Run Code Online (Sandbox Code Playgroud)
并且您要使用 Dataset API,您可以执行以下操作:
val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+
scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+
Run Code Online (Sandbox Code Playgroud)
完毕。很简单,不是吗?
请参阅函数对象(有关explode和struct函数)。
| 归档时间: |
|
| 查看次数: |
16250 次 |
| 最近记录: |