我想知道,spark.sql.autoBroadcastJoinThreshold即使连接方案使用数据集API连接而不是使用Spark SQL,属性是否可用于在所有工作节点上广播较小的表(同时进行连接).
如果我的大表是250 Gigs而Smaller是20 Gigs,我是否需要设置此配置:spark.sql.autoBroadcastJoinThreshold= 21 Gigs(可能)以便将整个表/发送Dataset到所有工作节点?
示例:
数据集API连接
val result = rawBigger.as("b").join(
broadcast(smaller).as("s"),
rawBigger(FieldNames.CAMPAIGN_ID) === smaller(FieldNames.CAMPAIGN_ID),
"left_outer"
)
Run Code Online (Sandbox Code Playgroud)SQL
select *
from rawBigger_table b, smaller_table s
where b.campign_id = s.campaign_id;
Run Code Online (Sandbox Code Playgroud)我正在开发一个 spark 2.0.0 部分,我的要求是在我的 sql 上下文中使用“com.facebook.hive.udf.UDFNumberRows”函数以在其中一个查询中使用。在我的带有 Hive 查询的集群中,我将它用作临时函数,只需定义:CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows',这非常简单。
我尝试使用 sparkSession 注册它,如下所示,但出现错误:
sparkSession.sql("""CREATE TEMPORARY FUNCTION myFunc AS 'com.facebook.hive.udf.UDFNumberRows'""")
Run Code Online (Sandbox Code Playgroud)
错误 :
CREATE TEMPORARY FUNCTION rowsequence AS 'com.facebook.hive.udf.UDFNumberRows'
16/11/01 20:46:17 ERROR ApplicationMaster: User class threw exception: java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
java.lang.UnsupportedOperationException: Use sqlContext.udf.register(...) instead.
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeFunctionBuilder(SessionCatalog.scala:751)
at org.apache.spark.sql.execution.command.CreateFunctionCommand.run(functions.scala:61)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:186)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:65)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) …Run Code Online (Sandbox Code Playgroud)