我正在尝试应用带有两个参数的 pandas_udf。但我有这个错误。首先我尝试使用一个参数,没问题:
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.config('spark.cores.max', 100) \
.getOrCreate()
df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))
Run Code Online (Sandbox Code Playgroud)
数据是这样的
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1| 2.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
+---+----+
Run Code Online (Sandbox Code Playgroud)
我的 pandas_udf 函数是
@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def count_udf(v):
cond = v<=3
res = v[cond].count()
return res
df.groupby("id").agg(count_udf(df['v'])).show()
Run Code Online (Sandbox Code Playgroud)
结果是
+---+------------+
| …Run Code Online (Sandbox Code Playgroud)