如何对pyspark数据框中的列值求和

Question

如何对pyspark数据框中的列值求和

Lau*_*ren 2 sum dataframe apache-spark pyspark

我在Pyspark工作，并且有一个包含以下各列的数据框。

Q1 = spark.read.csv("Q1final.csv",header = True, inferSchema = True)
Q1.printSchema()

root
|-- index_date: integer (nullable = true)
|-- item_id: integer (nullable = true)
|-- item_COICOP_CLASSIFICATION: integer (nullable = true)
|-- item_desc: string (nullable = true)
|-- index_algorithm: integer (nullable = true)
|-- stratum_ind: integer (nullable = true)
|-- item_index: double (nullable = true)
|-- all_gm_index: double (nullable = true)
|-- gm_ra_index: double (nullable = true)
|-- coicop_weight: double (nullable = true)
|-- item_weight: double (nullable = true)
|-- cpih_coicop_weight: double (nullable = true)

Run Code Online (Sandbox Code Playgroud)

我需要最后一列（cpih_coicop_weight）中所有元素的总和才能在程序的其他部分中用作Double。我该怎么做？提前非常感谢您！

Answer 1

Ste*_*ven 9

尝试这个：

from pyspark.sql import functions as F
total = Q1.groupBy().agg(F.sum("cpih_coicop_weight")).collect()

Run Code Online (Sandbox Code Playgroud)

在total，你应该有你的结果。

Answer 2

Lou*_*ang 9

如果只想将double或int作为返回值，则可以使用以下函数：

def sum_col(df, col):
    return df.select(F.sum(col)).collect()[0][0]

Run Code Online (Sandbox Code Playgroud)

然后

sum_col(Q1, 'cpih_coicop_weight')

Run Code Online (Sandbox Code Playgroud)

将返回总和。我是pyspark的新手，所以我不确定为什么库中没有这种简单的列对象方法。

Answer 3

Ath*_*har 5

这个也可以试试。

total = Q1.agg(F.sum("cpih_coicop_weight")).collect()

Run Code Online (Sandbox Code Playgroud)

归档时间：	7 年，9 月前
查看次数：	15931 次
最近记录：	7 年，2 月前