查找组中出现次数最多的值

dsr*_*301 2 clickhouse

我想找到每组中出现次数最多的值。

我尝试使用 top(k)(column) 但出现以下错误:列类不在聚合函数下且不在 GROUP BY 中。

例如:如果我有表 test_date 和 columns(pid, value)

pid, value
----------
1,a
1,b
1,a
1,c
Run Code Online (Sandbox Code Playgroud)

我想要结果:

pid, value
----------
1,a
Run Code Online (Sandbox Code Playgroud)

我试过SELECT pid,top(1)(value) top_value FROM test_data group by pid

I get the error: 

Column value  is not under aggregate function and not in GROUP BY
Run Code Online (Sandbox Code Playgroud)

我也尝试过,anyHeavy()但它只适用于出现超过一半情况的值

vla*_*mir 6

此查询应该可以帮助您:

    SELECT
        pid,
        /*
        Decompose the query in parts:
        1. groupArray((value, count)): convert the group of rows with the same 'pid' to the array of tuples (value, count)
        2. arrayReverseSort: make reverse sorting by 'count' ('x.2' is 'count')
        3. [1].1: take the 'value' from the first item of the sorted array
        */
        arrayReverseSort(x -> x.2, groupArray((value, count)))[1].1 AS value
    FROM
    (
        SELECT
            pid,
            value,
            count() AS count
        FROM test_date
        GROUP BY
            pid,
            value
    )
    GROUP BY pid
    ORDER BY pid ASC
Run Code Online (Sandbox Code Playgroud)

  • 我故意忽略 *topK* 这个[原因](https://clickhouse.yandex/docs/en/query_language/agg_functions/reference/#topk-n-column):“这个函数不提供保证的结果。在在某些情况下,可能会发生错误,并且可能会返回不是最常见值的频繁值”。 (2认同)