如何将spark SchemaRDD转换为我的case类的RDD?

sam*_*est 10 sql apache-spark parquet

在spark文档中,很清楚如何从RDD您自己的案例类创建镶木地板文件; (来自文档)

val people: RDD[Person] = ??? // An RDD of case class objects, from the previous example.

// The RDD is implicitly converted to a SchemaRDD by createSchemaRDD, allowing it to be stored using Parquet.
people.saveAsParquetFile("people.parquet")
Run Code Online (Sandbox Code Playgroud)

但不清楚如何转换回来,我们真的想要一个readParquetFile我们可以做的方法:

val people: RDD[Person] = sc.readParquestFile[Person](path)
Run Code Online (Sandbox Code Playgroud)

其中定义了case类的那些值是由方法读取的那些值.

sam*_*est 6

我提出的最佳解决方案需要最少量的复制和粘贴新类如下(我仍然希望看到另一种解决方案)

首先,您必须定义案例类和(部分)可重用的工厂方法

import org.apache.spark.sql.catalyst.expressions

case class MyClass(fooBar: Long, fred: Long)

// Here you want to auto gen these functions using macros or something
object Factories extends java.io.Serializable {
  def longLong[T](fac: (Long, Long) => T)(row: expressions.Row): T = 
    fac(row(0).asInstanceOf[Long], row(1).asInstanceOf[Long])
}
Run Code Online (Sandbox Code Playgroud)

一些锅炉板已经可以使用了

import scala.reflect.runtime.universe._
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
Run Code Online (Sandbox Code Playgroud)

魔法

import scala.reflect.ClassTag
import org.apache.spark.sql.SchemaRDD

def camelToUnderscores(name: String) = 
  "[A-Z]".r.replaceAllIn(name, "_" + _.group(0).toLowerCase())

def getCaseMethods[T: TypeTag]: List[String] = typeOf[T].members.sorted.collect {
  case m: MethodSymbol if m.isCaseAccessor => m
}.toList.map(_.toString)

def caseClassToSQLCols[T: TypeTag]: List[String] = 
  getCaseMethods[T].map(_.split(" ")(1)).map(camelToUnderscores)

def schemaRDDToRDD[T: TypeTag: ClassTag](schemaRDD: SchemaRDD, fac: expressions.Row => T) = {
  val tmpName = "tmpTableName" // Maybe should use a random string
  schemaRDD.registerAsTable(tmpName)
  sqlContext.sql("SELECT " + caseClassToSQLCols[T].mkString(", ") + " FROM " + tmpName)
  .map(fac)
}
Run Code Online (Sandbox Code Playgroud)

使用示例

val parquetFile = sqlContext.parquetFile(path)

val normalRDD: RDD[MyClass] = 
  schemaRDDToRDD[MyClass](parquetFile, Factories.longLong[MyClass](MyClass.apply))
Run Code Online (Sandbox Code Playgroud)

也可以看看:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Convert-SchemaRDD-back-to-RDD-td9071.html

虽然我没有通过遵循JIRA链接找到任何示例或文档.


mar*_*ios 5

一种简单的方法是提供自己的转换器(Row) => CaseClass.这是一个更多的手册,但如果你知道你在读什么,它应该是非常简单的.

这是一个例子:

import org.apache.spark.sql.SchemaRDD

case class User(data: String, name: String, id: Long)

def sparkSqlToUser(r: Row): Option[User] = {
    r match {
      case Row(time: String, name: String, id: Long) => Some(User(time,name, id))
      case _ => None
    }
}

val parquetData: SchemaRDD = sqlContext.parquetFile("hdfs://localhost/user/data.parquet")

val caseClassRdd: org.apache.spark.rdd.RDD[User] = parquetData.flatMap(sparkSqlToUser)
Run Code Online (Sandbox Code Playgroud)