ADA*_*H K 5 java scala dataset dataframe apache-spark
我需要从现有的 DataFrame 创建一个 DataFrame,在其中我还需要更改架构。
我有一个像这样的数据框:
+-----------+----------+-------------+
|Id |Position |playerName |
+-----------+-----------+------------+
|10125 |Forward |Messi |
|10126 |Forward |Ronaldo |
|10127 |Midfield |Xavi |
|10128 |Midfield |Neymar |
Run Code Online (Sandbox Code Playgroud)
我是使用下面给出的案例类创建的:
case class caseClass (
Id: Int = "",
Position : String = "" ,
playerName : String = ""
)
Run Code Online (Sandbox Code Playgroud)
现在我需要在结构类型下创建玩家名和位置。
IE,
我需要创建另一个带有模式的 DataFrame,
根
|-- ID: int (nullable = true)
|-- 玩家详细信息:结构(可空 = true)
| |--玩家名:字符串(可空 = true)
| |--位置:字符串(可空 = true)
我通过引用链接https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803执行了以下代码来创建新的数据框
我的模式是
List(
StructField("Id", IntegerType, true),
StructField("Position",StringType, true),
StructField("playerName", StringType,true)
)
Run Code Online (Sandbox Code Playgroud)
我尝试了以下代码
spark.sparkContext.parallelize(data),
myschema
)
Run Code Online (Sandbox Code Playgroud)
但我做不到。
我看到类似的问题 更改现有数据框的架构,但我无法理解解决方案。
有没有直接在案例类中实现StructType的解决方案?所以我认为我不需要创建自己的模式来创建结构类型值。
可以使用函数“struct”:
// data
val playersDF = Seq(
(10125, "Forward", "Messi"),
(10126, "Forward", "Ronaldo"),
(10127, "Midfield", "Xavi"),
(10128, "Midfield", "Neymar")
).toDF("Id", "Position", "playerName")
// action
val playersStructuredDF = playersDF.select($"Id", struct("playerName", "Position").as("playerDetails"))
// display
playersStructuredDF.printSchema()
playersStructuredDF.show(false)
Run Code Online (Sandbox Code Playgroud)
输出:
root
|-- Id: integer (nullable = false)
|-- playerDetails: struct (nullable = false)
| |-- playerName: string (nullable = true)
| |-- Position: string (nullable = true)
+-----+------------------+
|Id |playerDetails |
+-----+------------------+
|10125|[Messi, Forward] |
|10126|[Ronaldo, Forward]|
|10127|[Xavi, Midfield] |
|10128|[Neymar, Midfield]|
+-----+------------------+
Run Code Online (Sandbox Code Playgroud)