将 Array[String] 的 Spark 列拆分为 String 列

Jon*_*han 4 arrays string split apache-spark

如果我有一个包含一列 Array[String] 的数据框:

scala> y.show
+---+----------+
|uid|event_comb|
+---+----------+
|  c|  [xx, zz]|
|  b|  [xx, xx]|
|  b|  [xx, yy]|
|  b|  [xx, zz]|
|  b|  [xx, yy]|
|  b|  [xx, zz]|
|  b|  [yy, zz]|
|  a|  [xx, yy]|
+---+----------+
Run Code Online (Sandbox Code Playgroud)

如何将列拆分"event_comb"为两列(例如"event1""event2")?

Sha*_*ala 5

如果您的列类型是 list 或 Map 您可以使用 getItem 函数来获取值

getItem(Object key)

从数组中获取序号位置的项,或通过 MapType 中的键键获取值的表达式。

val data = Seq(
    ("c", List("xx", "zz")),
  ("b", List("xx", "xx")),
  ("b", List("xx", "yy")),
  ("b", List("xx", "zz")),
  ("b", List("xx", "yy")),
  ("b", List("xx", "zz")),
  ("b", List("yy", "zz")),
  ("a", List("xx", "yy"))
  ).toDF("uid", "event_comb")

  data.withColumn("event1", $"event_comb".getItem(0))
      .withColumn("event2", $"event_comb".getItem(1))
      .show(false)
Run Code Online (Sandbox Code Playgroud)

输出:

+---+----------+------+------+
|uid|event_comb|event1|event2|
+---+----------+------+------+
|c  |[xx, zz]  |xx    |zz    |
|b  |[xx, xx]  |xx    |xx    |
|b  |[xx, yy]  |xx    |yy    |
|b  |[xx, zz]  |xx    |zz    |
|b  |[xx, yy]  |xx    |yy    |
|b  |[xx, zz]  |xx    |zz    |
|b  |[yy, zz]  |yy    |zz    |
|a  |[xx, yy]  |xx    |yy    |
+---+----------+------+------+
Run Code Online (Sandbox Code Playgroud)