from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
df = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
pks = ["UK_1","UK_2"]
df1 = (
df
.select(columns)
#.withColumn('pk',F.concat(pks))
.withColumn('pk',F.concat("UK_1","UK_2"))
)
df1.show()
Run Code Online (Sandbox Code Playgroud)
有什么方法可以将列列表传递给concat吗?我想将代码用于列可以变化的场景,我想将其作为列表传递。
是的,语法是*argspython中的(可变数量的参数):
df.withColumn("pk", F.concat(*pks)).show()
+----+----+------------+---+----+
|UK_1|UK_2| Date|Cat| pk|
+----+----+------------+---+----+
| 1| 1| 12/10/2016| A| 11|
| 1| 2| null| A| 12|
| 2| 1| 14/10/2016| B| 21|
| 3| 3|!~2016/2/276| B| 33|
|null| 1| 26/09/2016| A|null|
| 1| 1| 12/10/2016| A| 11|
| 1| 2| null| A| 12|
| 2| 1| 14/10/2016| B| 21|
|null|null|!~2016/2/276| B|null|
|null| 1| 26/09/2016| A|null|
+----+----+------------+---+----+
Run Code Online (Sandbox Code Playgroud)