如何计算RDD [Long]的标准偏差和平均值?

Mar*_*kus 0 scala apache-spark apache-spark-sql

RDD[Long]打电话mod,我想使用Spark 2.2和Scala 2.11.8计算此RDD的标准偏差和平均值.

我该怎么做?

我尝试按如下方式计算平均值,但有没有更简单的方法来获取这些值?

val avg_val = mod.toDF("col").agg(
    avg($"col").as("avg")
).first().toString().toDouble

val stddev_val = mod.toDF("col").agg(
    stddev($"col").as("avg")
).first().toString().toDouble
Run Code Online (Sandbox Code Playgroud)

hi-*_*zir 6

我有RDD [Long]叫mod,我想计算标准偏差和平均值

只需使用stats:

scala> val mod = sc.parallelize(Seq(1L, 3L, 5L))
mod: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> val stats = mod.stats
stats: org.apache.spark.util.StatCounter = (count: 3, mean: 3.000000, stdev: 1.632993, max: 5.000000, min: 1.000000)

scala> stats.mean
res0: Double = 3.0

scala> stats.stdev
res1: Double = 1.632993161855452
Run Code Online (Sandbox Code Playgroud)

它使用相同的内部a stdev,mean但必须只扫描一次数据.

随着Dataset我建议:

val (avg_val, stddev_val) = mod.toDS
  .agg(mean("value"), stddev("value"))
  .as[(Double, Double)].first
Run Code Online (Sandbox Code Playgroud)

要么

import org.apache.spark.sql.Row

val Row(avg_val: Double, stddev_val: Double) = mod.toDS
  .agg(mean("value"), stddev("value"))
  .first
Run Code Online (Sandbox Code Playgroud)

但这里既不必要也没用.