TypeError：“列”对象无法使用WithColumn调用

Question

TypeError：“列”对象无法使用WithColumn调用

Bru*_*nal 6 apache-spark apache-spark-sql pyspark spark-dataframe

我想从功能追加到数据框“ df”上的新列get_distance：

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

df = df.withColumn(
    "distance",
    lit(get_distance(df["column1"], df["column2"]))
)

Run Code Online (Sandbox Code Playgroud)

但是，我得到这个：

TypeError: 'Column' object is not callable

Run Code Online (Sandbox Code Playgroud)

我认为发生这种情况是因为x和y是Column对象，我需要转换为String在查询中使用。我对吗？如果是这样，我该怎么做？

Answer 1

AKs*_*AKs 8

Spark 应该知道你使用的函数不是普通函数而是 UDF。

因此，我们可以通过两种方式在数据帧上使用 UDF。

方法一：使用@udf注解

@udf
def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

df = df.withColumn(
    "distance",
    lit(get_distance(df["column1"], df["column2"]))
)

Run Code Online (Sandbox Code Playgroud)

方法 2：使用 pyspark.sql.functions.udf 注册 udf

def get_distance(x, y):
    dfDistPerc = hiveContext.sql("select column3 as column3, \
                                  from tab \
                                  where column1 = '" + x + "' \
                                  and column2 = " + y + " \
                                  limit 1")

    result = dfDistPerc.select("column3").take(1)
    return result

calculate_distance_udf = udf(get_distance, IntegerType())

df = df.withColumn(
    "distance",
    lit(calculate_distance_udf(df["column1"], df["column2"]))
)

Run Code Online (Sandbox Code Playgroud)

Answer 2

小智 6

您不能Column直接在对象上使用Python函数，除非它旨在对Column对象/表达式进行操作。您需要udf：
```
@udf
def get_distance(x, y):
    ...
```
Run Code Online (Sandbox Code Playgroud)
但是您不能SQLContext在udf（或一般的映射器）中使用。

只是join：

tab = hiveContext.table("tab").groupBy("column1", "column2").agg(first("column3"))
df.join(tab, ["column1", "column2"])

Run Code Online (Sandbox Code Playgroud)

归档时间：	8 年前
查看次数：	29334 次
最近记录：	7 年，8 月前