Spark UDF 对数组进行操作

Question

Spark UDF 对数组进行操作

Geo*_*ler 2 apache-spark apache-spark-sql

我有一个像这样的火花数据框：

+-------------+------------------------------------------+
|a            |destination                               |
+-------------+------------------------------------------+
|[a,Alice,1]  |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]]                |
|[c,Charlie,0]|[[b,Bob,0]]                               |
|[b,Bob,0]    |[[c,Charlie,0]]                           |
|[f,Fanny,0]  |[[c,Charlie,0], [h,Fraudster,1]]          |
|[d,David,0]  |[[a,Alice,1], [e,Esther,0]]               |
+-------------+------------------------------------------+

Run Code Online (Sandbox Code Playgroud)

其架构为

|-- destination: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- var_only_0_and_1: integer (nullable = false)

Run Code Online (Sandbox Code Playgroud)

如何构造一个对列进行操作的 UDF destination，即由 Spark 的 UDF 创建的包装数组collect_list来计算变量的平均值var_only_0_and_1？

Answer 1

Cho*_*ops 5

只要你获得正确的 UDF 方法签名，你就可以直接对数组进行操作（这在过去给我带来了很大的打击）。数组列对于 UDF 来说是可见的，作为 Seq，而 Struct 则作为 Row 可见，因此您需要如下所示的内容：

def test (in:Seq[Row]): String = {
  // return a named field from the second struct in the array
  in(2).getAs[String]("var_only_0_and_1")
}

var udftest = udf(test _)

Run Code Online (Sandbox Code Playgroud)

我已经在与您类似的数据上对此进行了测试。我猜想可以迭代 Seq[Row] 的字段来实现你想要的。

老实说，我完全不确定这样做的类型安全性，并且我相信按照@ayplam 的说法，爆炸是更好的方法。内置函数通常比开发人员提供的任何 UDF 都快，因为 Spark 无法优化 UDF。

归档时间：	8 年，1 月前
查看次数：	2938 次
最近记录：	3 年，3 月前