rai*_*all 5 sql apache-spark apache-spark-sql pyspark
在Sql中,我可以使用UPDATE轻松更新某些列值,例如:我有一个表(学生),如:
student_id, grade, new_student_id
123 B 234
555 A null
UPDATE Student
SET student_id = new_student_id
WHERE new_student_id isNotNull
Run Code Online (Sandbox Code Playgroud)
如何使用SparkSql(PySpark)在Spark中执行此操作?
如果不为null,则可以使用原始值withColumn覆盖现有new_student_id列new_student_id,否则student_id使用该列中的值:
from pyspark.sql.functions import col,when
#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])
#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id",
when(col("new_student_id").isNull(), col("student_id")).
otherwise(col("new_student_id")))
cleaned.show()
Run Code Online (Sandbox Code Playgroud)
使用样本数据作为输入:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| null|
+----------+-----+--------------+
Run Code Online (Sandbox Code Playgroud)
输出数据如下:
+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
| 123| B| 234|
| 555| A| 555|
+----------+-----+--------------+
Run Code Online (Sandbox Code Playgroud)