小编Apr*_*ari的帖子

spark.sql.autoBroadcastJoinThreshold是否可以使用数据集的连接运算符进行连接?

我想知道,spark.sql.autoBroadcastJoinThreshold即使连接方案使用数据集API连接而不是使用Spark SQL,属性是否可用于在所有工作节点上广播较小的表(同时进行连接).

如果我的大表是250 Gigs而Smaller是20 Gigs,我是否需要设置此配置:spark.sql.autoBroadcastJoinThreshold= 21 Gigs(可能)以便将整个表/发送Dataset到所有工作节点?

示例:

apache-spark apache-spark-sql

10
推荐指数
2
解决办法
2万
查看次数

使用 Spark (Spark SQL) 2.0.0 注册 Hive 自定义 UDF

我正在开发一个 spark 2.0.0 部分,我的要求是在我的 sql 上下文中使用“com.facebook.hive.udf.UDFNumberRows”函数以在其中一个查询中使用。在我的带有 Hive 查询的集群中,我将它用作临时函数,只需定义:CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows',这非常简单。

我尝试使用 sparkSession 注册它,如下所示,但出现错误:

sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
Run Code Online (Sandbox Code Playgroud)

错误 :

CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
    at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
    at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) …
Run Code Online (Sandbox Code Playgroud)

apache-spark apache-spark-sql udf

4
推荐指数
2
解决办法
8254
查看次数

标签 统计

apache-spark ×2

apache-spark-sql ×2

udf ×1