Shu*_*wal 11 python dataframe pyspark
我需要使用 python 中的 pyspark 将数据帧的多列合并为一个列表(或元组)作为列的值的单列。
Input dataframe:
+-------+-------+-------+-------+-------+
| name |mark1 |mark2 |mark3 | Grade |
+-------+-------+-------+-------+-------+
| Jim | 20 | 30 | 40 | "C" |
+-------+-------+-------+-------+-------+
| Bill | 30 | 35 | 45 | "A" |
+-------+-------+-------+-------+-------+
| Kim | 25 | 36 | 42 | "B" |
+-------+-------+-------+-------+-------+
Output dataframe should be
+-------+-----------------+
| name |marks |
+-------+-----------------+
| Jim | [20,30,40,"C"] |
+-------+-----------------+
| Bill | [30,35,45,"A"] |
+-------+-----------------+
| Kim | [25,36,42,"B"] |
+-------+-----------------+
Run Code Online (Sandbox Code Playgroud)
Mic*_*nko 13
列可以与 sparks 数组函数合并:
import pyspark.sql.functions as f
columns = [f.col("mark1"), ...]
output = input.withColumn("marks", f.array(columns)).select("name", "marks")
Run Code Online (Sandbox Code Playgroud)
您可能需要更改条目的类型才能成功合并
看看这个文档:https ://spark.apache.org/docs/2.1.0/ml-features.html#vectorassembler
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["mark1", "mark2", "mark3"],
outputCol="marks")
output = assembler.transform(dataset)
output.select("name", "marks").show(truncate=False)
Run Code Online (Sandbox Code Playgroud)