Yog*_*esh 6 python apache-spark-sql pyspark
假设原始数据如下:
Competitor Region ProductA ProductB
Comp1 A £10 £15
Comp1 B £11 £16
Comp1 C £11 £15
Comp2 A £9 £16
Comp2 B £12 £14
Comp2 C £14 £17
Comp3 A £11 £16
Comp3 B £10 £15
Comp3 C £12 £15
Run Code Online (Sandbox Code Playgroud)
(参考:Python-根据列值将数据框拆分为多个数据框,并使用这些值命名)
我希望获得基于列值的子数据框列表,例如Region,例如:
df_A :
Competitor Region ProductA ProductB
Comp1 A £10 £15
Comp2 A £9 £16
Comp3 A £11 £16
Run Code Online (Sandbox Code Playgroud)
在Python中,我可以这样做:
for region, df_region in df.groupby('Region'):
print(df_region)
Run Code Online (Sandbox Code Playgroud)
如果df是Pyspark df,我可以做同样的迭代吗?
在Pyspark中,一旦执行df.groupBy(“ Region”),我就会获得GroupedData。我不需要像count,mean等之类的任何聚合。我只需要子数据帧的列表,每个子数据帧都有相同的“ Region”值。可能?
假设分组列中的唯一值列表足够小以适合驱动程序的内存,则以下方法应为您工作。希望这可以帮助!
import pyspark.sql.functions as F
import pandas as pd
# Sample data
df = pd.DataFrame({'region': ['aa','aa','aa','bb','bb','cc'],
'x2': [6,5,4,3,2,1],
'x3': [1,2,3,4,5,6]})
df = spark.createDataFrame(df)
# Get unique values in the grouping column
groups = [x[0] for x in df.select("region").distinct().collect()]
# Create a filtered DataFrame for each group in a list comprehension
groups_list = [df.filter(F.col('region')==x) for x in groups]
# show the results
[x.show() for x in groups_list]
Run Code Online (Sandbox Code Playgroud)
结果:
+------+---+---+
|region| x2| x3|
+------+---+---+
| cc| 1| 6|
+------+---+---+
+------+---+---+
|region| x2| x3|
+------+---+---+
| bb| 3| 4|
| bb| 2| 5|
+------+---+---+
+------+---+---+
|region| x2| x3|
+------+---+---+
| aa| 6| 1|
| aa| 5| 2|
| aa| 4| 3|
+------+---+---+
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
1454 次 |
最近记录: |