小编ayp*_*lam的帖子

如何将pyspark UDF导入主类

我有两个文件。functions.py具有一个函数,并从该函数创建pyspark udf。main.py尝试导入udf。但是,main.py似乎无法访问中的功能functions.py

functions.py:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def do_something(x):
    return x + 'hello'

sample_udf = udf(lambda x: do_something(x), StringType())
Run Code Online (Sandbox Code Playgroud)

main.py:

from functions import sample_udf, do_something
df = spark.read.load(file)
df.withColumn("sample",sample_udf(col("text")))
Run Code Online (Sandbox Code Playgroud)

这会导致错误:

17/10/03 19:35:29 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, ip-10-223-181-5.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/worker.py", line 164, in main
    func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
  File "/usr/lib/spark/python/pyspark/worker.py", line 93, …
Run Code Online (Sandbox Code Playgroud)

python user-defined-functions apache-spark pyspark

3
推荐指数
2
解决办法
5734
查看次数