如何更改spark sql中的列值

Question

如何更改spark sql中的列值

rai*_*all 5 sql apache-spark apache-spark-sql pyspark

在Sql中,我可以使用UPDATE轻松更新某些列值,例如:我有一个表(学生),如:

student_id, grade, new_student_id
123             B      234
555             A      null

UPDATE Student
SET student_id = new_student_id
WHERE new_student_id isNotNull

Run Code Online (Sandbox Code Playgroud)

如何使用SparkSql(PySpark)在Spark中执行此操作？

Answer 1

Ale*_*lex 5

如果不为null，则可以使用原始值withColumn覆盖现有new_student_id列new_student_id，否则student_id使用该列中的值：

from pyspark.sql.functions import col,when

#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])

#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id", 
          when(col("new_student_id").isNull(), col("student_id")).
          otherwise(col("new_student_id")))
cleaned.show()

Run Code Online (Sandbox Code Playgroud)

使用样本数据作为输入：

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|          null|
+----------+-----+--------------+

Run Code Online (Sandbox Code Playgroud)

输出数据如下：

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|           555|
+----------+-----+--------------+

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年，8 月前
查看次数：	10557 次
最近记录：	8 年，8 月前