jbr*_*own 4 dataframe apache-spark apache-spark-sql
我有一组嵌套的case类.我有一个使用这些案例类生成数据集的作业,并将输出写入镶木地板.
我非常恼火地发现我必须手动加载并加载这些数据并将其转换回案例类以便在后续作业中使用它.无论如何,这就是我现在要做的事情.
我的案例类如下:
case class Person(userId: String, tech: Option[Tech])
case class Tech(browsers: Seq[Browser], platforms: Seq[Platform])
case class Browser(family: String, version: Int)
Run Code Online (Sandbox Code Playgroud)
所以我正在加载我的镶木地板数据.我可以将tech数据作为Row:
val df = sqlContext.load("part-r-00716.gz.parquet")
val x = df.head
val tech = x.getStruct(x.fieldIndex("tech"))
Run Code Online (Sandbox Code Playgroud)
但现在我找不到如何实际迭代浏览器.如果我尝试val browsers = tech.getStruct(tech.fieldIndex("browsers"))我得到一个例外:
java.lang.ClassCastException: scala.collection.mutable.WrappedArray$ofRef cannot be cast to org.apache.spark.sql.Row
Run Code Online (Sandbox Code Playgroud)
如何使用spark 1.5.2迭代我的嵌套浏览器数据?
更新
实际上,我的case类包含可选值,所以Browser实际上是:
case class Browser(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
language: String,
timesSeen: Long = 1,
firstSeenAt: Long,
lastSeenAt: Long)
Run Code Online (Sandbox Code Playgroud)
我也有类似的Os:
case class Os(family: String,
major: Option[String] = None,
minor: Option[String] = None,
patch: Option[String] = None,
patchMinor: Option[String],
override val timesSeen: Long = 1,
override val firstSeenAt: Long,
override val lastSeenAt: Long)
Run Code Online (Sandbox Code Playgroud)
所以Tech真的是:
case class Technographic(browsers: Seq[Browser],
devices: Seq[Device],
oss: Seq[Os])
Run Code Online (Sandbox Code Playgroud)
现在,鉴于某些值是可选的,我需要一个允许我正确重构我的case类的解决方案.当前解决方案不支持None值,因此例如给定输入数据:
Tech(browsers=Seq(
Browser(family=Some("IE"), major=Some(7), language=Some("en"), timesSeen=3),
Browser(family=None, major=None, language=Some("en-us"), timesSeen=1),
Browser(family=Some("Firefox), major=None, language=None, timesSeen=1)
)
)
Run Code Online (Sandbox Code Playgroud)
我需要它来加载数据如下:
family=IE, major=7, language=en, timesSeen=3,
family=None, major=None, language=en-us, timesSeen=1,
family=Firefox, major=None, language=None, timesSeen=1
Run Code Online (Sandbox Code Playgroud)
因为当前解决方案不支持None值,所以它实际上每个列表项具有任意数量的值,即:
browsers.family = ["IE", "Firefox"]
browsers.major = [7]
browsers.language = ["en", "en-us"]
timesSeen = [3, 1, 1]
Run Code Online (Sandbox Code Playgroud)
如您所见,无法将最终数据(由spark返回)转换为生成它的案例类.
我该如何解决这种疯狂?
Ber*_*ium 10
一些例子
// Select two columns
df.select("userId", "tech.browsers").show()
// Select the nested values only
df.select("tech.browsers").show(truncate = false)
+-------------------------+
|browsers |
+-------------------------+
|[[Firefox,4], [Chrome,2]]|
|[[Firefox,4], [Chrome,2]]|
|[[IE,25]] |
|[] |
|null |
+-------------------------+
// Extract the family (nested value)
// This way you can iterate over the persons, and get their browsers
// Family values are nested
df.select("tech.browsers.family").show()
+-----------------+
| family|
+-----------------+
|[Firefox, Chrome]|
|[Firefox, Chrome]|
| [IE]|
| []|
| null|
+-----------------+
// Normalize the family: One row for each family
// Then you can iterate over all families
// Family values are un-nested, empty values/null/None are handled by explode()
df.select(explode(col("tech.browsers.family")).alias("family")).show()
+-------+
| family|
+-------+
|Firefox|
| Chrome|
|Firefox|
| Chrome|
| IE|
+-------+
Run Code Online (Sandbox Code Playgroud)
基于最后一个例子:
val families = df.select(explode(col("tech.browsers.family")))
.map(r => r.getString(0)).distinct().collect().toList
println(families)
Run Code Online (Sandbox Code Playgroud)
在"普通"本地Scala列表中提供唯一的浏览器列表:
列表(IE,Firefox,Chrome)
| 归档时间: |
|
| 查看次数: |
8106 次 |
| 最近记录: |