Woo*_*per 6 hive scala apache-spark parquet
我写了一个DataFrame作为镶木地板文件.而且,我想使用Hive使用镶木地板中的元数据来阅读该文件.
书写镶木地板写的输出
_common_metadata part-r-00000-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00002-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet _SUCCESS
_metadata part-r-00001-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet part-r-00003-0def6ca1-0f54-4c53-b402-662944aa0be9.gz.parquet
Run Code Online (Sandbox Code Playgroud)
蜂巢表
CREATE TABLE testhive
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'/home/gz_files/result';
FAILED: SemanticException [Error 10043]: Either list of columns or a custom serializer should be specified
Run Code Online (Sandbox Code Playgroud)
如何从镶木地板文件中推断元数据?
如果我打开_common_metadata我有以下内容,
PAR1LHroot
%TSN%
%TS%
%Etype%
)org.apache.spark.sql.parquet.row.metadata?{"type":"struct","fields":[{"name":"TSN","type":"string","nullable":true,"metadata":{}},{"name":"TS","type":"string","nullable":true,"metadata":{}},{"name":"Etype","type":"string","nullable":true,"metadata":{}}]}
Run Code Online (Sandbox Code Playgroud)
或者如何解析元数据文件?
Jam*_*bin 11
这是我用来从镶木地板文件中获取元数据以创建Hive表的解决方案.
首先启动一个spark-shell(或者将它们全部编译成一个Jar并使用spark-submit运行它,但shell更容易使用SOO)
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.DataFrame
val df=sqlContext.parquetFile("/path/to/_common_metadata")
def creatingTableDDL(tableName:String, df:DataFrame): String={
val cols = df.dtypes
var ddl1 = "CREATE EXTERNAL TABLE "+tableName + " ("
//looks at the datatypes and columns names and puts them into a string
val colCreate = (for (c <-cols) yield(c._1+" "+c._2.replace("Type",""))).mkString(", ")
ddl1 += colCreate + ") STORED AS PARQUET LOCATION '/wherever/you/store/the/data/'"
ddl1
}
val test_tableDDL=creatingTableDDL("test_table",df,"test_db")
Run Code Online (Sandbox Code Playgroud)
它将为您提供Hive将用于每个列的数据类型,因为它们存储在Parquet中.例如: CREATE EXTERNAL TABLE test_table (COL1 Decimal(38,10), COL2 String, COL3 Timestamp) STORED AS PARQUET LOCATION '/path/to/parquet/files'
小智 8
我想扩展James Tobin的答案.有一个StructField类,它提供Hive的数据类型而不进行字符串替换.
// Tested on Spark 1.6.0.
import org.apache.spark.sql.DataFrame
def dataFrameToDDL(dataFrame: DataFrame, tableName: String): String = {
val columns = dataFrame.schema.map { field =>
" " + field.name + " " + field.dataType.simpleString.toUpperCase
}
s"CREATE TABLE $tableName (\n${columns.mkString(",\n")}\n)"
}
Run Code Online (Sandbox Code Playgroud)
这解决了IntegerType问题.
scala> val dataFrame = sc.parallelize(Seq((1, "a"), (2, "b"))).toDF("x", "y")
dataFrame: org.apache.spark.sql.DataFrame = [x: int, y: string]
scala> print(dataFrameToDDL(dataFrame, "t"))
CREATE TABLE t (
x INT,
y STRING
)
Run Code Online (Sandbox Code Playgroud)
这应该适用于任何DataFrame,而不仅仅是Parquet.(例如,我正在使用JDBC DataFrame.)
作为额外的好处,如果您的目标DDL支持可为空的列,您可以通过检查来扩展该功能StructField.nullable.
| 归档时间: |
|
| 查看次数: |
13324 次 |
| 最近记录: |