Geo*_*eRF 4 python apache-spark apache-spark-sql pyspark pyspark-sql
我想在spark数据帧中总结不同的列.
码
from pyspark.sql import functions as F
cols = ["A.p1","B.p1"]
df = spark.createDataFrame([[1,2],[4,89],[12,60]],schema=cols)
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Run Code Online (Sandbox Code Playgroud)
为什么不接近#2..不工作?我在Spark 2.2上
因为,
# 1. Works
df = df.withColumn('sum1', sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Run Code Online (Sandbox Code Playgroud)
在这里,您使用python内置sum函数,它将iterable作为输入,因此它可以工作.https://docs.python.org/2/library/functions.html#sum
#2. Doesnt work
df = df.withColumn('sum1', F.sum([df[col] for col in ["`A.p1`","`B.p1`"]]))
Run Code Online (Sandbox Code Playgroud)
在这里,您使用的是pyspark sum函数,它将列作为输入,但您尝试在行级获取它. http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.sum
#3. Doesnt work
df = df.withColumn('sum1', sum(df.select(["`A.p1`","`B.p1`"])))
Run Code Online (Sandbox Code Playgroud)
这里,df.select()返回一个数据帧并尝试对数据帧求和.在这种情况下,我认为,你必须逐行迭代并对其应用总和.
归档时间: |
|
查看次数: |
8822 次 |
最近记录: |