SQL on Spark：如何获取DISTINCT的所有值？

Question

SQL on Spark：如何获取DISTINCT的所有值？

因此，假设我有下表：

Name | Color
------------------------------
John | Blue
Greg | Red
John | Yellow
Greg | Red
Greg | Blue

Run Code Online (Sandbox Code Playgroud)

我想为每个名称获取一个不同颜色的表格-数量和它们的值。意思是这样的：

Name | Distinct | Values
--------------------------------------
John |   2      | Blue, Yellow
Greg |   2      | Red, Blue

Run Code Online (Sandbox Code Playgroud)

有什么想法怎么做？

Answer 1

Zah*_*Mor 7

collect_list将为您提供一个列表，而不会删除重复项。collect_set将自动删除重复项，因此

select 
Name,
count(distinct color) as Distinct, # not a very good name
collect_set(Color) as Values
from TblName
group by Name

Run Code Online (Sandbox Code Playgroud)

从spark 1.6.0开始实施此功能：

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

/**
   * Aggregate function: returns a set of objects with duplicate elements eliminated.
   *
   * For now this is an alias for the collect_set Hive UDAF.
   *
   * @group agg_funcs
   * @since 1.6.0
   */
  def collect_set(columnName: String): Column = collect_set(Column(columnName))

Run Code Online (Sandbox Code Playgroud)

归档时间：	9 年，9 月前
查看次数：	14654 次
最近记录：	9 年，8 月前