如何将StructType从Spark中的json数据框而不是列中分解为行

Question

如何将StructType从Spark中的json数据框而不是列中分解为行

tri*_*cky 3 scala apache-spark apache-spark-sql

我用这个模式读了一个嵌套的json:

 root
 |-- company: struct (nullable = true)
 |    |-- 0: string (nullable = true)
 |    |-- 1: string (nullable = true)
 |    |-- 10: string (nullable = true)
 |    |-- 100: string (nullable = true)
 |    |-- 101: string (nullable = true)
 |    |-- 102: string (nullable = true)
 |    |-- 103: string (nullable = true)
 |    |-- 104: string (nullable = true)
 |    |-- 105: string (nullable = true)
 |    |-- 106: string (nullable = true)
 |    |-- 107: string (nullable = true)
 |    |-- 108: string (nullable = true)
 |    |-- 109: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

当我尝试:

df.select(col("company.*"))

Run Code Online (Sandbox Code Playgroud)

我将struct"company"的每个字段都作为列.但我希望它们成为行.我想在另一列中获取id和字符串的行:

  0        1         10       100      101        102 
"hey"   "yooyo"    "yuyu"    "hey"   "yooyo"    "yuyu"

Run Code Online (Sandbox Code Playgroud)

而是得到类似的东西:

id    name
0     "hey"
1     "yoooyo"
10    "yuuy"
100   "hey"
101   "yooyo"
102    "yuyu"

Run Code Online (Sandbox Code Playgroud)

在此先感谢您的帮助,

狡猾

Answer 1

Rap*_*oth 7

尝试使用union:

val dfExpl = df.select("company.*")

dfExpl.columns
.map(name => dfExpl.select(lit(name),col(name)))
  .reduce(_ union _)
  .show

Run Code Online (Sandbox Code Playgroud)

或者使用array/explode:

val dfExpl = df.select("company.*")
val selectExpr = dfExpl
  .columns
  .map(name =>
    struct(
      lit(name).as("id"),
      col(name).as("value")
    ).as("col")
  )


dfExpl
  .select(
    explode(array(selectExpr: _*))
  )
  .select("col.*")
  .show()

Run Code Online (Sandbox Code Playgroud)

这个答案有 pyspark 版本吗？ (2认同)

归档时间：	7 年，11 月前
查看次数：	1245 次
最近记录：	7 年，11 月前