vis*_*akh 9 hive scala apache-spark shark-sql
我正在尝试计算Spark SQL上的Jaccard索引.我的表格中Hive包含以下数据:
hive> select * from test_1;
1 ["rock","pop"]
2 ["metal","rock"]
Run Code Online (Sandbox Code Playgroud)
表DDL:
create table test_1
(id int, val array<string>);
Run Code Online (Sandbox Code Playgroud)
我使用的是UDF来自Brickhouse.从spark-shell,我可以执行以下命令来创建临时函数.
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
hiveContext.hql("CREATE TEMPORARY FUNCTION jaccard_similarity AS 'brickhouse.udf.sketch.SetSimilarityUDF'")
Run Code Online (Sandbox Code Playgroud)
我还将.jar文件添加到CLASSPATHfor spark-shell(in compute-classpath.sh)中.
当我列出函数时,我能够看到我创建的新函数.
hiveContext.hql("show functions").collect().foreach(println)
Run Code Online (Sandbox Code Playgroud)
接下来,我使用该jaccard_similarity函数计算val数组的Jaccard Index .
hiveContext.hql("select jaccard_similarity(a.val, b.val) from test_1 a join test_1 b")
Run Code Online (Sandbox Code Playgroud)
我收到以下错误:
14/07/31 15:39:56 INFO ParseDriver: Parsing command: select jaccard_similarity(a.val, b.val) from test_1 a join test_1 b
14/07/31 15:39:56 INFO ParseDriver: Parse Completed
14/07/31 15:39:56 INFO Analyzer: Max iterations (2) reached for batch MultiInstanceRelations
14/07/31 15:39:56 INFO Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences
14/07/31 15:39:56 INFO HiveMetaStore: 0: get_table : db=default tbl=test_1
14/07/31 15:39:56 INFO audit: ugi=username ip=unknown-ip-addr cmd=get_table : db=default tbl=test_1
14/07/31 15:39:56 INFO HiveMetaStore: 0: get_table : db=default tbl=test_1
14/07/31 15:39:56 INFO audit: ugi=username ip=unknown-ip-addr cmd=get_table : db=default tbl=test_1
scala.MatchError: ArrayType(StringType) (of class org.apache.spark.sql.catalyst.types.ArrayType)
at org.apache.spark.sql.hive.HiveInspectors$typeInfoConversions.toTypeInfo(hiveUdfs.scala:382)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$2.apply(hiveUdfs.scala:52)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$2.apply(hiveUdfs.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:52)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:131)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:129)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:52)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:66)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:65)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5.applyOrElse(Analyzer.scala:129)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5.applyOrElse(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:126)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:62)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:60)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:60)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:52)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:52)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:313)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:313)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:248)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:247)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:316)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:316)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:319)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:319)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.simpleString(HiveContext.scala:315)
at org.apache.spark.sql.SchemaRDDLike$class.toString(SchemaRDDLike.scala:67)
at org.apache.spark.sql.SchemaRDD.toString(SchemaRDD.scala:100)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .<init>(<console>:10)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:601)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:608)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:611)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:936)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run Code Online (Sandbox Code Playgroud)
我查看了SparkGitHub 的源代码.在datatypes.scala中,有以下代码:
protected lazy val arrayType: Parser[DataType] =
"ArrayType" ~> "(" ~> dataType ~ "," ~ boolVal <~ ")" ^^ {
case tpe ~ _ ~ containsNull => ArrayType(tpe, containsNull)
Run Code Online (Sandbox Code Playgroud)
我找不到任何array不受Spark SQL支持的引用.如果有人可以分享关于如何使其工作的任何指示,那将是很棒的.
此外,该功能从Hiveshell 完美运行.
更新(8月5日):
我只是在Github的Master分支上构建Spark.错误消息有更多信息(比如scala.MatchError: ArrayType(StringType,false)代替scala.MatchError: ArrayType(StringType))
scala> hiveContext.hql("select jaccard_similarity(a.val, b.val) from test_1 a join test_1 b")
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
14/08/05 13:54:53 INFO ParseDriver: Parsing command: select jaccard_similarity(a.val, b.val) from test_1 a join test_1 b
14/08/05 13:54:53 INFO ParseDriver: Parse Completed
14/08/05 13:54:53 INFO HiveMetaStore: 0: get_table : db=default tbl=test_1
14/08/05 13:54:53 INFO audit: ugi=chandrv1 ip=unknown-ip-addr cmd=get_table : db=default tbl=test_1
14/08/05 13:54:53 INFO HiveMetaStore: 0: get_table : db=default tbl=test_1
14/08/05 13:54:53 INFO audit: ugi=chandrv1 ip=unknown-ip-addr cmd=get_table : db=default tbl=test_1
scala.MatchError: ArrayType(StringType,false) (of class org.apache.spark.sql.catalyst.types.ArrayType)
at org.apache.spark.sql.hive.HiveInspectors$typeInfoConversions.toTypeInfo(HiveInspectors.scala:216)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$2.apply(hiveUdfs.scala:52)
at org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$2.apply(hiveUdfs.scala:52)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:52)
at org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:253)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:41)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:41)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:41)
at org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:253)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:131)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:129)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:52)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:66)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:65)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:70)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:41)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5.applyOrElse(Analyzer.scala:129)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$5.applyOrElse(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:127)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$.apply(Analyzer.scala:126)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:61)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$2.apply(RuleExecutor.scala:59)
at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
at scala.collection.immutable.List.foldLeft(List.scala:84)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:59)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1.apply(RuleExecutor.scala:51)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.apply(RuleExecutor.scala:51)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed$lzycompute(SQLContext.scala:394)
at org.apache.spark.sql.SQLContext$QueryExecution.analyzed(SQLContext.scala:394)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan$lzycompute(HiveContext.scala:350)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.optimizedPlan(HiveContext.scala:349)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:399)
at org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:397)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:403)
at org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:403)
at org.apache.spark.sql.hive.HiveContext$QueryExecution.simpleString(HiveContext.scala:419)
at org.apache.spark.sql.SchemaRDDLike$class.toString(SchemaRDDLike.scala:67)
at org.apache.spark.sql.SchemaRDD.toString(SchemaRDD.scala:103)
at scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
at .<init>(<console>:10)
at .<clinit>(<console>)
at $print(<console>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1061)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:859)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:771)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:616)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:624)
at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:629)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:954)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:902)
at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:997)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:314)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run Code Online (Sandbox Code Playgroud)
我还看了HiveInspectors.scala(第212行typeInfoConversions).似乎ArrayType那里没有定义.
抱歉,我在 StackOverflow 上的声誉不够高,无法发表评论。希望这个“答案”能很好地传达给您。不管怎样,我正在使用 HiveContext 玩 SparkSQL,并注意到 ArrayType 的行为非常相似。虽然它不能解决你的问题,但它可以解释原因
事实证明,仅当使用 Spark“内部”表结构时,HiveContext (Spark 1.1.0) 才支持 ArrayType。每当您尝试访问 Spark“外部”Hive 表(即托管在 Metastore 上)时,您可能会遇到类似的问题,不支持 ArrayType。
这是一个简单的说明
// ************
// ArrayType is supported when playing with SparkSQL temp tables...
// ************
val sqlContext = org.apache.spark.sql.hive.HiveContext(sc)
val rdd = sqlContext.jsonFile("/tmp/test.json")
rdd.printSchema
/*
root
|-- id: integer (nullable = true)
|-- names: array (nullable = true)
| |-- element: string (containsNull = false)
*/
sqlContext.registerRDDAsTable(rdd,"test")
val out = sqlContext.sql("SELECT names FROM test")
// ************
// ...But fail on Hive statements
// ************
sqlContext.sql("CREATE TABLE mytable AS SELECT names FROM test")
/*
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 16, vagrant): scala.MatchError: ArrayType(StringType,true) (of class org.apache.spark.sql.catalyst.types.ArrayType)
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
.../...
*/
Run Code Online (Sandbox Code Playgroud)
我仍然不知道它失败的确切原因/位置,但 HiveContext 不(完全)支持 ArrayType。不管怎样,我怀疑你在这里描述的问题与你的 jaccard UDF 函数有关。
或者,使用这样一个(有效的)丑陋的黑客:)
sqlContext.sql("CREATE TABLE mytable AS SELECT split(concat_ws('#',names),'#') FROM test")
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
5196 次 |
| 最近记录: |