我有两个文件。functions.py具有一个函数,并从该函数创建pyspark udf。main.py尝试导入udf。但是,main.py似乎无法访问中的功能functions.py。
functions.py:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def do_something(x):
return x + 'hello'
sample_udf = udf(lambda x: do_something(x), StringType())
Run Code Online (Sandbox Code Playgroud)
main.py:
from functions import sample_udf, do_something
df = spark.read.load(file)
df.withColumn("sample",sample_udf(col("text")))
Run Code Online (Sandbox Code Playgroud)
这会导致错误:
17/10/03 19:35:29 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, ip-10-223-181-5.ec2.internal, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/lib/spark/python/pyspark/worker.py", line 164, in main
func, profiler, deserializer, serializer = read_udfs(pickleSer, infile)
File "/usr/lib/spark/python/pyspark/worker.py", line 93, …Run Code Online (Sandbox Code Playgroud)