小编Hey*_*Man的帖子

databricks-connect 无法加载 udf 中的模块

我正在尝试加载PyNaCl到在 Windows 上运行的 pyspark UDF。

from nacl import bindings as c

def verify_signature(msg, keys):
    c.crypto_sign_ed25519ph_update(...)
    ...

verify_signature_udf = udf(lambda x: verify_signature(x, public_keys), BooleanType())

data_signed = data.withColumn(
    "is_signature_valid", verify_signature_udf("state_values")
)
Run Code Online (Sandbox Code Playgroud)

PyNaCl已在本地安装(使用databricks-connect),但据我了解,它没有安装在执行器上。因此我得到这个:

File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 679, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'nacl'
Run Code Online (Sandbox Code Playgroud)

正如Python 打包中所述,我尝试像这样加载它:

File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle.py", line 679, in subimport
    __import__(name)
ModuleNotFoundError: No module named 'nacl'
Run Code Online (Sandbox Code Playgroud)

没有变化,同样的消息。如果我只是从 tar.gz 中提取 nacl 包并将其存储为 zip 文件并按如下方式加载:

import os
os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder.config(
    "spark.archives",
    "pyspark_venv.tar.gz#environment").getOrCreate()
Run Code Online (Sandbox Code Playgroud)

它已加载,但我现在收到此错误: …

python pyspark databricks pynacl databricks-connect

5
推荐指数
0
解决办法
315
查看次数

标签 统计

databricks ×1

databricks-connect ×1

pynacl ×1

pyspark ×1

python ×1