调用返回FloatType()的UDF时,“构造ClassDict的预期零参数(对于numpy.dtype)”

mom*_*ind 3 python dataframe pyspark pyspark-sql

我相信它与此有关:火花错误:构造ClassDict预期使用零参数(对于numpy.core.multiarray._reconstruct)

我有一个数据框

id col_1 col_2
1 [1,2] [1,3]
2 [2,1] [3,4]
Run Code Online (Sandbox Code Playgroud)

我想创建另一个列,该列是和cosine之间的距离。col_1col_2

from scipy.spatial.distance import cosine

def cosine_distance(a,b):
    try:
        return cosine(a, b)
    except Exception as e:
        return 0.0 # in case division by zero
Run Code Online (Sandbox Code Playgroud)

我定义了一个udf

cosine_distance_udf = udf (cosine_distance, FloatType())

最后:

new_df = df.withColumn('cosine_distance', cosine_distance_udf('col_1', 'col_2'))

我有错误: PickleException: expected zero arguments for construction of ClassDict (for numpy.dtype)

我做错什么了?

cs9*_*s95 5

当您检查的返回类型时,错误的原因很明显cosine

type(cosine([1, 2], [1, 3]))
# numpy.float64
Run Code Online (Sandbox Code Playgroud)

但是,np.float64是的子类float

issubclass(np.float64, float)
# True
Run Code Online (Sandbox Code Playgroud)

因此,只需对您的功能进行一点改动,

def cosine_distance(a, b):
    try:
        return float(cosine(a, b)) # cosine(a, b).item()
    except Exception as e:
        return 0.0 # in case division by zero
Run Code Online (Sandbox Code Playgroud)

这会起作用

df.withColumn('cosine_distance', cosine_distance_udf('col_1', 'col_2')).show()

+------+------+---------------+
| col_1| col_2|cosine_distance|
+------+------+---------------+
|[1, 2]|[3, 4]|     0.01613009|
|[2, 1]|[3, 4]|     0.10557281|
+------+------+---------------+
Run Code Online (Sandbox Code Playgroud)