相关疑难解决方法(0)

如何在Spark数据框中展平结构？

我有一个具有以下结构的数据帧:

 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- keyNote: struct (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- note: string (nullable = true)
 |    |-- details: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

Run Code Online (Sandbox Code Playgroud)

如何展平结构并创建新的数据框:

     |-- id: long (nullable = true)
     |-- keyNote: struct (nullable = true)
     |    |-- key: string (nullable = true)
     |    |-- note: …

Run Code Online (Sandbox Code Playgroud)

java apache-spark apache-spark-sql

djW*_*ann

2019 01-14

24
推荐指数

6
解决办法

3万
查看次数

pyspark dataframe 中是否有类似于 pandas.io.json.json_normalize 的函数

我想执行类似于 pandas.io.json.json_normalize 是 pyspark 数据帧的操作。Spark中有类似的功能吗？

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

python json pandas pyspark

Kev*_* Xu

lucky-day

8
推荐指数

1
解决办法

1万
查看次数

在Spark SQL中“展平” DataFrame时出现Spark AnalysisException

我正在使用此处给出的方法在Spark SQL中展平一个DataFrame。这是我的代码：

package com.acme.etl.xml

import org.apache.spark.sql.types._ 
import org.apache.spark.sql.{Column, SparkSession}

object RuntimeError {   def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("FlattenSchema").getOrCreate()
    val rowTag = "idocData"
    val dataFrameReader =
        spark.read
          .option("rowTag", rowTag)
    val xmlUri = "bad_011_1.xml"
    val df =
        dataFrameReader
          .format("xml")
          .load(xmlUri)
    val schema: StructType = df.schema
    val columns: Array[Column] = flattenSchema(schema)
    val df2 = df.select(columns: _*)

  }

  def flattenSchema(schema: StructType, prefix: String = null) : Array[Column] = {
    schema.fields.flatMap(f => {
      val colName: String = if (prefix …

Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql

Pau*_*ers

2019 04-24

5
推荐指数

1
解决办法

351
查看次数

由于数据类型不匹配 PySpark 无法解析列

PySpark 中遇到的错误：

pyspark.sql.utils.AnalysisException: "cannot resolve '`result_set`.`dates`.`trackers`['token']' due to data type mismatch: argument 2 requires integral type, however, ''token'' is of string type.;;\n'Project [result_parameters#517, result_set#518, <lambda>(result_set#518.dates.trackers[token]) AS result_set.dates.trackers.token#705]\n+- Relation[result_parameters#517,result_set#518] json\n"

Run Code Online (Sandbox Code Playgroud)

数据结构：

-- result_set: struct (nullable = true)
 |    |-- currency: string (nullable = true)
 |    |-- dates: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- date: string (nullable = true)
 |    |    |    |-- trackers: array (nullable = true)
 |    |    |    | …

Run Code Online (Sandbox Code Playgroud)

python pyspark

dat*_*ews

2020 03-12

3
推荐指数

1
解决办法

3万
查看次数

有没有办法收集 pyspark 中嵌套模式中所有字段的名称

我希望收集嵌套模式中所有字段的名称。数据是从 json 文件导入的。

该架构如下所示：

root
 |-- column_a: string (nullable = true)
 |-- column_b: string (nullable = true)
 |-- column_c: struct (nullable = true)
 |    |-- nested_a: struct (nullable = true)
 |    |    |-- double_nested_a: string (nullable = true)
 |    |    |-- double_nested_b: string (nullable = true)
 |    |    |-- double_nested_c: string (nullable = true)
 |    |-- nested_b: string (nullable = true)
 |-- column_d: string (nullable = true)

Run Code Online (Sandbox Code Playgroud)

如果我使用df.schema.fieldsordf.schema.names它只是打印列层的名称 - 没有嵌套列。

我想要的期望输出是一个 python 列表，其中包含所有列名称，例如：

['column_a', 'columb_b', 'column_c.nested_a.double_nested.a', …

Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql pyspark

Kev*_*ger

lucky-day

2
推荐指数

1
解决办法

3006
查看次数

仅展平 Scala Spark 数据帧中的最深级别

我有一个 Spark 作业，它有一个具有以下值的 DataFrame：

{
  "id": "abchchd",
  "test_id": "ndsbsb",
  "props": {
    "type": {
      "isMale": true,
      "id": "dd",
      "mcc": 1234,
      "name": "Adam"
    }
  }
}

{
  "id": "abc",
  "test_id": "asf",
  "props": {
    "type2": {
      "isMale": true,
      "id": "dd",
      "mcc": 12134,
      "name": "Perth"
    }
  }
}

Run Code Online (Sandbox Code Playgroud)

我想优雅地将它展平（因为没有未知的键和类型等），这样道具仍然是一个，struct但里面的所有东西都被展平了（不管嵌套的级别如何）

所需的输出是：

{
  "id": "abchchd",
  "test_id": "ndsbsb",
  "props": {
    "type.isMale": true,
    "type.id": "dd",
    "type.mcc": 1234,
    "type.name": "Adam"
  }
}

{
  "id": "abc",
  "test_id": "asf",
  "props": {
      "type2.isMale": true,
      "type2.id": "dd",
      "type2.mcc": 12134,
      "type2.name": "Perth" …

Run Code Online (Sandbox Code Playgroud)

json scala flatten apache-spark apache-spark-sql

myt*_*hic

2020 01-14

1
推荐指数

1
解决办法

333
查看次数