当使用pyspark时,我希望能够计算分组值与组的中位数之间的差异.这可能吗?这是我破解的一些代码,它做我想要的,除了它从均值计算分组差异.另外,如果你觉得有帮助,请随时评论我如何能做得更好:)
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import (
StringType,
LongType,
DoubleType,
StructField,
StructType
)
from pyspark.sql import functions as F
sc = SparkContext(appName='myapp')
spark = SparkSession(sc)
file_name = 'data.csv'
fields = [
StructField(
'group2',
LongType(),
True),
StructField(
'name',
StringType(),
True),
StructField(
'value',
DoubleType(),
True),
StructField(
'group1',
LongType(),
True)
]
schema = StructType(fields)
df = spark.read.csv(
file_name, header=False, mode="DROPMALFORMED", schema=schema
)
df.show()
means = df.select([
'group1',
'group2',
'name',
'value']).groupBy([
'group1',
'group2'
]).agg(
F.mean('value').alias('mean_value')
).orderBy('group1', 'group2')
cond …Run Code Online (Sandbox Code Playgroud)