如何根据 PySpark 中其他列中的计算创建新列

Question

如何根据 PySpark 中其他列中的计算创建新列

iva*_*lan 1 python apache-spark apache-spark-sql pyspark

我有一个以下数据框：

+-----------+----------+----------+
|   some_id | one_col  | other_col|
+-----------+----------+----------+
|       xx1 |        11|       177|         
|       xx2 |      1613|      2000|    
|       xx4 |         0|     12473|      
+-----------+----------+----------+

Run Code Online (Sandbox Code Playgroud)

我需要添加一个新列，该列基于对第一列和第二列进行的一些计算，即，例如，对于 col1_value=1 和 col2_value=10 需要生成 col1 包含在 col2 中的百分比，因此 col3_value = (1/10)*100=10%:

+-----------+----------+----------+--------------+
|   some_id | one_col  | other_col|  percentage  |
+-----------+----------+----------+--------------+
|       xx1 |        11|       177|     6.2      |  
|       xx3 |         1|       10 |      10      |     
|       xx2 |      1613|      2000|     80.6     |
|       xx4 |         0|     12473|      0       |
+-----------+----------+----------+--------------+

Run Code Online (Sandbox Code Playgroud)

我知道我需要为此使用 udf，但是如何根据结果直接添加新的列值？

一些伪代码：

import pyspark
from pyspark.sql.functions import udf

df = load_my_df

def my_udf(val1, val2):
    return (val1/val2)*100

udf_percentage = udf(my_udf, FloatType())

df = df.withColumn('percentage', udf_percentage(# how?))

Run Code Online (Sandbox Code Playgroud)

谢谢你！

Answer 1

小智 5

df.withColumn('percentage', udf_percentage("one_col", "other_col"))

Run Code Online (Sandbox Code Playgroud)

或者

df.withColumn('percentage', udf_percentage(df["one_col"], df["other_col"]))

Run Code Online (Sandbox Code Playgroud)

或者

df.withColumn('percentage', udf_percentage(df.one_col, df.other_col))

Run Code Online (Sandbox Code Playgroud)

或者

from pyspark.sql.functions import col

df.withColumn('percentage', udf_percentage(col("one_col"), col("other_col")))

Run Code Online (Sandbox Code Playgroud)

但为什么不只是：

df.withColumn('percentage', col("one_col") / col("other_col") * 100)

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，7 月前
查看次数：	9633 次
最近记录：	7 年，7 月前