sha*_*dzy 4 sql apache-spark-sql
因此,假设我有下表:
Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue
Run Code Online (Sandbox Code Playgroud)
我想为每个名称获取一个不同颜色的表格-数量和它们的值。意思是这样的:
Name | Distinct | Values
--------------------------------------
John | 2 | Blue, Yellow
Greg | 2 | Red, Blue
Run Code Online (Sandbox Code Playgroud)
有什么想法怎么做?
collect_list将为您提供一个列表,而不会删除重复项。collect_set将自动删除重复项,因此
select
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name
Run Code Online (Sandbox Code Playgroud)
从spark 1.6.0开始实施此功能:
/**
* Aggregate function: returns a set of objects with duplicate elements eliminated.
*
* For now this is an alias for the collect_set Hive UDAF.
*
* @group agg_funcs
* @since 1.6.0
*/
def collect_set(columnName: String): Column = collect_set(Column(columnName))
Run Code Online (Sandbox Code Playgroud)