小编bil*_*St3的帖子

azure pyspark从jar注册udf失败UDFRegistration

我在注册 java 文件中的一些 udf 时遇到问题。我有几种方法,但它们都会返回:

无法执行用户定义的函数(UDFRegistration$$Lambda$6068/1550981127: (double, double) => double)

首先我尝试了这种方法:

from pyspark.context import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.types import *
conf=SparkConf()
conf.set('spark.driver.extraClassPath', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')
conf.set('spark.jars', 'dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar')

spark = SparkSession(sc)
sc = SparkContext.getOrCreate(conf=conf)
#spark.sparkContext.addPyFile("dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar")
udfs = [
    ('jaro_winkler_sim', 'JaroWinklerSimilarity',DoubleType()),
    ('jaccard_sim', 'JaccardSimilarity',DoubleType()),
    ('cosine_distance', 'CosineDistance',DoubleType()),
    ('Dmetaphone', 'DoubleMetaphone',StringType()),
    ('QgramTokeniser', 'QgramTokeniser',StringType())
]
for a,b,c in udfs:
    spark.udf.registerJavaFunction(a, 'uk.gov.moj.dash.linkage.'+ b, c)

linker = Splink(settings, spark, df_l=df_l, df_r=df_r)
df_e = linker.get_scored_comparisons()
Run Code Online (Sandbox Code Playgroud)

接下来我尝试将 jar 和 extraClassPath 移动到集群配置。

spark.jars dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
spark.driver.extraClassPath dbfs:/FileStore/jars/4b129434_12cd_4f2a_ab27_baaefe904857-scala_udf_similarity_0_0_7-35e3b.jar
Run Code Online (Sandbox Code Playgroud)

我将它们注册到我的脚本中,如下所示:

from pyspark.context import …
Run Code Online (Sandbox Code Playgroud)

azure apache-spark pyspark databricks azure-databricks

5
推荐指数
1
解决办法
544
查看次数