use*_*994 3 aggregate-functions apache-spark pyspark
我有一个火花df
spark_df = spark.createDataFrame(
[(1, 7, 'foo'),
(2, 6, 'bar'),
(3, 4, 'foo'),
(4, 8, 'bar'),
(5, 1, 'bar')
],
['v1', 'v2', 'id']
)
Run Code Online (Sandbox Code Playgroud)
预期输出
id avg(v1) avg(v2) min(v1) min(v2) 0.25(v1) 0.25(v2) 0.5(v1) 0.5(v2)
0 bar 3.666667 5.0 2 1 some-value some-value some-value some-value
1 foo 2.000000 5.5 1 4. some-value some-value some-value some-value
Run Code Online (Sandbox Code Playgroud)
到目前为止,我已经可以实现平均值、最小值、最大值等基本统计数据。但无法获得分位数。我知道,这可以在 Pandas 中轻松实现,但无法在 Pyspark 中完成
另外,我知道 approxQuantile,但我无法将基本统计数据与 pyspark 中的分位数结合起来
到目前为止,我可以使用 agg 获得平均值和最小值等基本统计数据。我也想要相同 df 中的分位数
func = [F.mean, F.min,]
NUMERICAL_FEATURE_LIST = ['v1', 'v2']
GROUP_BY_FIELDS = ['id']
exp = [f(F.col(c)) for f in func for c in NUMERICAL_FEATURE_LIST]
df_fin = spark_df.groupby(*GROUP_BY_FIELDS).agg(*exp)
Run Code Online (Sandbox Code Playgroud)
也许这会有所帮助——
val spark_df = Seq((1, 7, "foo"),
(2, 6, "bar"),
(3, 4, "foo"),
(4, 8, "bar"),
(5, 1, "bar")
).toDF("v1", "v2", "id")
spark_df.show(false)
spark_df.printSchema()
spark_df.summary() // default= "count", "mean", "stddev", "min", "25%", "50%", "75%", "max"
.show(false)
/**
* +---+---+---+
* |v1 |v2 |id |
* +---+---+---+
* |1 |7 |foo|
* |2 |6 |bar|
* |3 |4 |foo|
* |4 |8 |bar|
* |5 |1 |bar|
* +---+---+---+
*
* root
* |-- v1: integer (nullable = false)
* |-- v2: integer (nullable = false)
* |-- id: string (nullable = true)
*
* +-------+------------------+------------------+----+
* |summary|v1 |v2 |id |
* +-------+------------------+------------------+----+
* |count |5 |5 |5 |
* |mean |3.0 |5.2 |null|
* |stddev |1.5811388300841898|2.7748873851023217|null|
* |min |1 |1 |bar |
* |25% |2 |4 |null|
* |50% |3 |6 |null|
* |75% |4 |7 |null|
* |max |5 |8 |foo |
* +-------+------------------+------------------+----+
*/
Run Code Online (Sandbox Code Playgroud)
如果您需要格式,请使用下面的答案。
| 归档时间: |
|
| 查看次数: |
2618 次 |
| 最近记录: |