将行值转换为spark数据帧中的列数组

Question

将行值转换为spark数据帧中的列数组

pra*_*ads 1 scala apache-spark spark-dataframe

我正在研究spark数据帧,我需要按列进行分组,并将分组行的列值转换为元素数组作为新列.示例:

Input:

employee | Address
------------------
Micheal  |  NY
Micheal  |  NJ

Output:

employee | Address
------------------
Micheal  | (NY,NJ)

Run Code Online (Sandbox Code Playgroud)

任何帮助都非常感谢.

Answer 1

Vis*_*667 5

这是一个替代解决方案,我已将数据帧转换为rdd进行转换,并使用转换后的数据框转换回来 sqlContext.createDataFrame()

Sample.json

{"employee":"Michale","Address":"NY"}
{"employee":"Michale","Address":"NJ"}
{"employee":"Sam","Address":"NY"}
{"employee":"Max","Address":"NJ"}

Run Code Online (Sandbox Code Playgroud)

Spark应用程序

val df = sqlContext.read.json("sample.json")

// Printing the original Df
df.show()

//Defining the Schema for the aggregated DataFrame
val dataSchema = new StructType(
  Array(
    StructField("employee", StringType, nullable = true),
    StructField("Address", ArrayType(StringType, containsNull = true), nullable = true)
  )
)
// Converting the df to rdd and performing the groupBy operation
val aggregatedRdd: RDD[Row] = df.rdd.groupBy(r =>
          r.getAs[String]("employee")
        ).map(row =>
          // Mapping the Grouped Values to a new Row Object
          Row(row._1, row._2.map(_.getAs[String]("Address")).toArray)
        )

// Creating a DataFrame from the aggregatedRdd with the defined Schema (dataSchema)
val aggregatedDf = sqlContext.createDataFrame(aggregatedRdd, dataSchema)

// Printing the aggregated Df
aggregatedDf.show()

Run Code Online (Sandbox Code Playgroud)

输出:

+-------+--------+---+
|Address|employee|num|
+-------+--------+---+
|     NY| Michale|  1|
|     NJ| Michale|  2|
|     NY|     Sam|  3|
|     NJ|     Max|  4|
+-------+--------+---+

+--------+--------+
|employee| Address|
+--------+--------+
|     Sam|    [NY]|
| Michale|[NY, NJ]|
|     Max|    [NJ]|
+--------+--------+

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 5

如果您使用的是Spark 2.0+，则可以使用collect_list或collect_set。您的查询将类似于（假设您的数据框称为input）：

import org.apache.spark.sql.functions._

input.groupBy('employee).agg(collect_list('Address))

Run Code Online (Sandbox Code Playgroud)

如果您可以接受重复项，请使用collect_list. 如果您不同意重复项并且只需要列表中的唯一项目，请使用collect_set.

希望这可以帮助！

归档时间：	10 年，2 月前
查看次数：	9949 次
最近记录：	7 年，6 月前