如何使用Dataset API编写字数统计？

Question

如何使用Dataset API编写字数统计？

sat*_*hya 3 java apache-spark apache-spark-sql

我需要单独使用 Spark 数据集编写字数统计逻辑。

我使用JavaRDDSpark 类实现了相同的过程，但我想通过使用Dataset<Row>Spark SQL 类来完成相同的过程。

如何在 Spark SQL 中进行字数统计？

Answer 1

Jac*_*ski 5

这是解决方案之一（很可能不是最有效的）。

// using col function as the OP uses Java not Scala...unfortunatelly
import org.apache.spark.sql.functions.col
val q = spark.
  read.
  text("README.md").
  filter(length(col("value")) > 0).
  withColumn("words", split(col("value"), "\\s+")).
  select(explode(col("words")) as "word").
  groupBy("word").
  count.
  orderBy(col("count").desc)
scala> q.show
+---------+-----+
|     word|count|
+---------+-----+
|      the|   24|
|       to|   17|
|    Spark|   16|
|      for|   12|
|      and|    9|
|       ##|    9|
|         |    8|
|        a|    8|
|       on|    7|
|      can|    7|
|      run|    7|
|       in|    6|
|       is|    6|
|       of|    5|
|    using|    5|
|      you|    4|
|       an|    4|
|    build|    4|
|including|    4|
|     with|    4|
+---------+-----+
only showing top 20 rows

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，7 月前
查看次数：	1379 次
最近记录：	7 年，1 月前