在Oracle中独立地从多个列中有效地查找前N个值

Jod*_*oda 4 sql oracle top-n

假设我有30个十亿行多列,我希望能够有效地独立发现每一列的前N个最频繁的值,并用最优雅的SQL可能.例如,如果我有

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris    Freemont Possum         Ubik
Nancy     Freemont Lemur          Housekeeping
Nancy     Drew     Penguin        Ubik
Bill      Ribbits  Lemur          Dhalgren
Run Code Online (Sandbox Code Playgroud)

我想要top-1,那么结果将是:

FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy     Freemont Lemur          Ubik
Run Code Online (Sandbox Code Playgroud)

我大概可以想办法做到这一点,但不知道他们是否是最优的,当有30个十亿行这是很重要的; SQL可能很大而且很丑,可能会使用太多的临时空间.

使用Oracle.

Ale*_*ole 5

这应该只通过表一次.您可以使用分析版本count()独立获取每个值的频率:

select firstname, count(*) over (partition by firstname) as c_fn,
    lastname, count(*) over (partition by lastname) as c_ln,
    favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
    favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;

FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill      1 Ribbits     1 Lemur      2 Dhalgren        1
Ferris    1 Freemont    2 Possum     1 Ubik            2
Nancy     2 Freemont    2 Lemur      2 Housekeeping    1
Nancy     2 Drew        1 Penguin    1 Ubik            2
Run Code Online (Sandbox Code Playgroud)

然后,您可以将其用作CTE(或子查询因子,我认为在oracle术语中)并仅从每列中提取最高频率值:

with tmp_tab as (
    select /*+ MATERIALIZE */
        firstname, count(*) over (partition by firstname) as c_fn,
        lastname, count(*) over (partition by lastname) as c_ln,
        favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
        favoritebook, count(*) over (partition by favoritebook) as c_fb
    from my_table)
select (select firstname from (
        select firstname,
            row_number() over (partition by null order by c_fn desc) as r_fn
            from tmp_tab
        ) where r_fn = 1) as firstname,
    (select lastname from (
        select lastname,
            row_number() over (partition by null order by c_ln desc) as r_ln
        from tmp_tab
        ) where r_ln = 1) as lastname,
    (select favoriteanimal from (
        select favoriteanimal,
            row_number() over (partition by null order by c_fa desc) as r_fa
        from tmp_tab
        ) where r_fa = 1) as favoriteanimal,
    (select favoritebook from (
        select favoritebook,
            row_number() over (partition by null order by c_fb desc) as r_fb
        from tmp_tab
        ) where r_fb = 1) as favoritebook
from dual;

FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy  Freemont Lemur   Ubik
Run Code Online (Sandbox Code Playgroud)

你做一个传过来的CTE为每列,但还是应该只打了真正的表一次(感谢materialize提示).而且你可能想要添加order by条款来调整如果有关系会怎么做.

这在概念上有什么蒂洛,YSTH和其他人提出类似,只不过你让甲骨文保留所有的计数的轨道.

编辑:嗯,解释计划显示它做了四个全表扫描; 可能需要多考虑一下...... 编辑2:将(未记录的)MATERIALIZE提示添加到CTE似乎解决了这个问题; 它正在创建一个临时临时表来保存结果,并且只进行一次全表扫描.然而,解释计划成本更高 - 至少在这个时间样本数据集上.对任何不利的评论感兴趣.