AJD*_*JDF 2 xml hadoop hive scala apache-spark
我有以下数据框,其中一些列包含数组。(我们使用的是spark 1.6)
+--------------------+--------------+------------------+--------------+--------------------+-------------+
| UserName| col1 | col2 |col3 |col4 |col5 |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|foo |[Main, Indi...|[1777203, 1777203]| [GBP, GBP]| [CR, CR]| [143, 143]|
+--------------------+--------------+------------------+--------------+--------------------+-------------+
Run Code Online (Sandbox Code Playgroud)
我期望以下结果:
+--------------------+--------------+------------------+--------------+--------------------+-------------+
| UserName| explod | explod2 |explod3 |explod4 |explod5 |
+--------------------+--------------+------------------+--------------+--------------------+-------------+
|NNNNNNNNNNNNNNNNN...| Main |1777203 | GBP | CR | 143 |
|NNNNNNNNNNNNNNNNN...|Individual |1777203 | GBP | CR | 143 |
----------------------------------------------------------------------------------------------------------
Run Code Online (Sandbox Code Playgroud)
我尝试过横向视图:
sqlContext.sql("SELECT `UserName`, explod, explod2, explod3, explod4, explod5 FROM sourceDF
LATERAL VIEW explode(`col1`) sourceDF AS explod
LATERAL VIEW explode(`col2`) explod AS explod2
LATERAL VIEW explode(`col3`) explod2 AS explod3
LATERAL VIEW explode(`col4`) explod3 AS explod4
LATERAL VIEW explode(`col5`) explod4 AS explod5")
Run Code Online (Sandbox Code Playgroud)
但我得到了一个笛卡尔积,有很多重复项。 我尝试过同样的方法,使用 withcolumn 方法爆炸所有列,但仍然得到很多重复项
.withColumn("col1", explode($"col1"))...
Run Code Online (Sandbox Code Playgroud)
当然,我可以对最终的数据帧进行区分,但这不是一个优雅的解决方案。有没有什么方法可以在不获得所有这些重复项的情况下分解列?
谢谢!
如果您使用 Spark 2.4.0 或更高版本,arrays_zip则任务会更容易
val df = Seq(
("foo",
Seq("Main", "Individual"),
Seq(1777203, 1777203),
Seq("GBP", "GBP"),
Seq("CR", "CR"),
Seq(143, 143)))
.toDF("UserName", "col1", "col2", "col3", "col4", "col5")
df.select($"UserName",
explode(arrays_zip($"col1", $"col2", $"col3", $"col4", $"col5")))
.select($"UserName", $"col.*")
.show()
Run Code Online (Sandbox Code Playgroud)
输出:
+--------+----------+-------+----+----+----+
|UserName| col1| col2|col3|col4|col5|
+--------+----------+-------+----+----+----+
| foo| Main|1777203| GBP| CR| 143|
| foo|Individual|1777203| GBP| CR| 143|
+--------+----------+-------+----+----+----+
Run Code Online (Sandbox Code Playgroud)