Tok*_*kyo 4 python dataframe apache-spark apache-spark-sql pyspark
我有两个数据框,df1并且df2:
df1.show()
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 0 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
df2.show()
+---+--------+-----+----+--------+
|cA | cB | cH | cI | cJ |
+---+--------+-----+----+--------+
| A| abc | a | b | ? |
| C| ghi | a | c | ? |
+---+--------+-----+----+--------+
Run Code Online (Sandbox Code Playgroud)
如果 中引用了该cE行,我想更新中 的列df1并将其设置为。每条记录均由和列标识。1df2cAcB
以下是所需的输出;请注意,cE第一条记录的值已更新为1:
+---+--------+-----+----+--------+
|cA | cB | cC | cD | cE |
+---+--------+-----+----+--------+
| A| abc | 0.1 | 0.0| 1 |
| B| def | 0.15| 0.5| 0 |
| C| ghi | 0.2 | 0.2| 1 |
| D| jkl | 1.1 | 0.1| 0 |
| E| mno | 0.1 | 0.1| 0 |
+---+--------+-----+----+--------+
Run Code Online (Sandbox Code Playgroud)
当存在基于另一列更新列值的情况时,when 子句会派上用场。请参阅when 和otherwise 子句。
import pyspark.sql.functions as F
df3=df1.join(df2,(df1.cA==df2.cA)&(df1.cB==df2.cB),"full").withColumn('cE',F.when((df1.cA==df2.cA)&(df1.cB==df2.cB),1).otherwise(0)).select(df1.cA,df1.cB,df1.cC,df1.cD,'cE')
df3.show()
+---+---+----+---+---+
| cA| cB| cC| cD| cE|
+---+---+----+---+---+
| E|mno| 0.1|0.1| 0|
| B|def|0.15|0.5| 0|
| C|ghi| 0.2|0.2| 1|
| A|abc| 0.1|0.0| 1|
| D|jkl| 1.1|0.1| 0|
+---+---+----+---+---+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
8839 次 |
| 最近记录: |