Apache Spark 中describe() 和summary() 的区别

Tok*_*kyo 1 apache-spark

summary()和 和有describe()什么区别?似乎它们都具有相同的目的,但没有设法找到任何差异(如果有的话)。

Shu*_*Shu 6

如果我们传递任何参数,那么这些函数用于不同的目的:

.describe()函数将cols:String*(df 中的列)作为可选参数。

.summary()函数将statistics:String*(count,mean,stddev..etc) 作为可选参数。

例子:

scala> val df_des=Seq((1,"a"),(2,"b"),(3,"c")).toDF("id","name")
scala> df_des.describe().show(false) //without args
//Result:
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count  |3  |3   |
//|mean   |2.0|null|
//|stddev |1.0|null|
//|min    |1  |a   |
//|max    |3  |c   |
//+-------+---+----+
scala> df_des.summary().show(false) //without args
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count  |3  |3   |
//|mean   |2.0|null|
//|stddev |1.0|null|
//|min    |1  |a   |
//|25%    |1  |null|
//|50%    |2  |null|
//|75%    |3  |null|
//|max    |3  |c   |
//+-------+---+----+
scala> df_des.describe("id").show(false) //descibe on id column only
//+-------+---+
//|summary|id |
//+-------+---+
//|count  |3  |
//|mean   |2.0|
//|stddev |1.0|
//|min    |1  |
//|max    |3  |
//+-------+---+
scala> df_des.summary("count").show(false) //get count summary only
//+-------+---+----+
//|summary|id |name|
//+-------+---+----+
//|count  |3  |3   |
//+-------+---+----+
Run Code Online (Sandbox Code Playgroud)