Sat*_*tya 4 python pyspark pyspark-sql
说我有这样的数据帧
name age city
abc 20 A
def 30 B
Run Code Online (Sandbox Code Playgroud)
我想在数据框的末尾添加一个摘要行,所以结果就像
name age city
abc 20 A
def 30 B
All 50 All
Run Code Online (Sandbox Code Playgroud)
所以String'All',我可以很容易地说,但是如何得到sum(df ['age'])###列对象是不可迭代的
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
#|-- name: string (nullable = true)
#|-- age: long (nullable = true)
#|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns)) ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error. If i am using [('All',50,'All')], it is doing fine.
Run Code Online (Sandbox Code Playgroud)
我经常处理Pandas数据帧和Spark新手.关于火花数据框架可能不是那么成熟.
请建议如何获得pyspark中dataframe-column的总和.如果有更好的方法来添加/追加一行到数据帧的末尾.谢谢.
swe*_*zel 13
Spark SQL有一个专用的列函数模块pyspark.sql.functions.
所以它的工作方式是:
from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
res = data.unionAll(
data.select([
F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
F.sum(data.age).alias('age'), # get the sum of 'age'
F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
]))
res.show()
Run Code Online (Sandbox Code Playgroud)
打印:
+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20| A|
| def| 30| B|
| All| 50| All|
+----+---+----+
Run Code Online (Sandbox Code Playgroud)
| 归档时间: |
|
| 查看次数: |
22259 次 |
| 最近记录: |