SQL on Spark:如何获取DISTINCT的所有值?

sha*_*dzy 4 sql apache-spark-sql

因此,假设我有下表:

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue
Run Code Online (Sandbox Code Playgroud)

我想为每个名称获取一个不同颜色的表格-数量和它们的值。意思是这样的:

Name | Distinct | Values
--------------------------------------
John |   2      | Blue, Yellow
Greg |   2      | Red, Blue
Run Code Online (Sandbox Code Playgroud)

有什么想法怎么做?

Zah*_*Mor 7

collect_list将为您提供一个列表,而不会删除重复项。collect_set将自动删除重复项,因此

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name
Run Code Online (Sandbox Code Playgroud)

从spark 1.6.0开始实施此功能:

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))
Run Code Online (Sandbox Code Playgroud)