小编Nik*_*wal的帖子

Pyspark 中的中位数和分位数值

在我的数据框中，我有一个年龄列。总行数约为 770 亿行。我想使用 PySpark 计算该列的分位数值。我有一些代码，但计算时间很大（也许我的过程很糟糕）。

有什么好的方法可以改善这个情况吗？

数据框示例：

id       age
1         18
2         32
3         54
4         63
5         42
6         23

Run Code Online (Sandbox Code Playgroud)

到目前为止我所做的：

#Summary stats
df.describe('age').show()

#For Quantile values
x5 = df.approxQuantile("age", [0.5], 0)
x25 = df.approxQuantile("age", [0.25], 0)
x75 = df.approxQuantile("age", [0.75], 0)

Run Code Online (Sandbox Code Playgroud)

python apache-spark apache-spark-sql pyspark

Nik*_*wal

2022 09-15

3
推荐指数

1
解决办法

1万
查看次数

在熊猫数据框中实现多个 if else 条件

我的数据框是 -

id       score
1          50
2          88
3          44
4          77
5          93

Run Code Online (Sandbox Code Playgroud)

我希望我的数据框看起来像 -

id       score      is_good
1          50        low
2          88        high
3          44        low
4          77        medium
5          93        high

Run Code Online (Sandbox Code Playgroud)

我已经完成了以下代码 -

def selector(row):
    if row['score'] >= 0 and row['score'] <= 50 :
        return "low"
    elif row['score'] > 50 and row['score'] <=80 :
        return "medium"
    else:
        return "high"

x['is_good'] = x.apply(lambda row : selector(x), axis=1)

Run Code Online (Sandbox Code Playgroud)

我认为逻辑很好，但代码不起作用。也许我们可以使用地图功能。

python pandas pandas-groupby

Nik*_*wal

lucky-day

2
推荐指数

1
解决办法

69
查看次数