什么是在pyspark中列出不同数据帧列的正确方法?

Geo*_*eRF 4 python apache-spark apache-spark-sql pyspark pyspark-sql

我想在spark数据帧中总结不同的列.

from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Run Code Online (Sandbox Code Playgroud)

为什么不接近#2..不工作?我在Spark 2.2上

Sur*_*esh 8

因为,

# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Run Code Online (Sandbox Code Playgroud)

在这里,您使用python内置sum函数,它将iterable作为输入,因此它可以工作.https://docs.python.org/2/library/functions.html#sum

#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Run Code Online (Sandbox Code Playgroud)

在这里,您使用的是pyspark sum函数,它将列作为输入,但您尝试在行级获取它. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum

#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Run Code Online (Sandbox Code Playgroud)

这里,df.select()返回一个数据帧并尝试对数据帧求和.在这种情况下,我认为,你必须逐行迭代并对其应用总和.