Vis*_*App 4 python dataframe pyspark
我有一个数据框,其中包含类似于以下列的列表.所有列中列表的长度不相同.
Name Age Subjects Grades
[Bob] [16] [Maths,Physics,Chemistry] [A,B,C]
Run Code Online (Sandbox Code Playgroud)
我希望以这样的方式分解数据帧,以便获得以下输出 -
Name Age Subjects Grades
Bob 16 Maths A
Bob 16 Physics B
Bob 16 Chemistry C
Run Code Online (Sandbox Code Playgroud)
我怎样才能做到这一点?
abe*_*bop 31
PySparkarrays_zip在 2.4 中添加了一个函数,从而无需 Python UDF 来压缩数组。
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df = df.withColumn("new", F.arrays_zip("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.Subjects").alias("Subjects"), F.col("new.Grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Run Code Online (Sandbox Code Playgroud)
Dav*_*itz 13
参加聚会迟到了:-)
最简单的方法是使用inline没有 python API 但受selectExpr.
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
Run Code Online (Sandbox Code Playgroud)
df.selectExpr('Name[0] as Name','Age[0] as Age','inline(arrays_zip(Subjects,Grades))').show()
Run Code Online (Sandbox Code Playgroud)
这有效,
import pyspark.sql.functions as F
from pyspark.sql.types import *
df = sql.createDataFrame(
[(['Bob'], [16], ['Maths','Physics','Chemistry'], ['A','B','C'])],
['Name','Age','Subjects', 'Grades'])
df.show()
+-----+----+--------------------+---------+
| Name| Age| Subjects| Grades|
+-----+----+--------------------+---------+
|[Bob]|[16]|[Maths, Physics, ...|[A, B, C]|
+-----+----+--------------------+---------+
Run Code Online (Sandbox Code Playgroud)
使用udf带zip.这些列需要explode在爆炸之前合并.
combine = F.udf(lambda x, y: list(zip(x, y)),
ArrayType(StructType([StructField("subs", StringType()),
StructField("grades", StringType())])))
df = df.withColumn("new", combine("Subjects", "Grades"))\
.withColumn("new", F.explode("new"))\
.select("Name", "Age", F.col("new.subs").alias("Subjects"), F.col("new.grades").alias("Grades"))
df.show()
+-----+----+---------+------+
| Name| Age| Subjects|Grades|
+-----+----+---------+------+
|[Bob]|[16]| Maths| A|
|[Bob]|[16]| Physics| B|
|[Bob]|[16]|Chemistry| C|
+-----+----+---------+------+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
6123 次 |
| 最近记录: |