假设我有30个十亿行多列,我希望能够有效地独立发现每一列的前N个最频繁的值,并用最优雅的SQL可能.例如,如果我有
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris Freemont Possum Ubik
Nancy Freemont Lemur Housekeeping
Nancy Drew Penguin Ubik
Bill Ribbits Lemur Dhalgren
Run Code Online (Sandbox Code Playgroud)
我想要top-1,那么结果将是:
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy Freemont Lemur Ubik
Run Code Online (Sandbox Code Playgroud)
我大概可以想办法做到这一点,但不知道他们是否是最优的,当有30个十亿行这是很重要的; SQL可能很大而且很丑,可能会使用太多的临时空间.
使用Oracle.
这应该只通过表一次.您可以使用分析版本count()独立获取每个值的频率:
select firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;
FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill 1 Ribbits 1 Lemur 2 Dhalgren 1
Ferris 1 Freemont 2 Possum 1 Ubik 2
Nancy 2 Freemont 2 Lemur 2 Housekeeping 1
Nancy 2 Drew 1 Penguin 1 Ubik 2
Run Code Online (Sandbox Code Playgroud)
然后,您可以将其用作CTE(或子查询因子,我认为在oracle术语中)并仅从每列中提取最高频率值:
with tmp_tab as (
select /*+ MATERIALIZE */
firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table)
select (select firstname from (
select firstname,
row_number() over (partition by null order by c_fn desc) as r_fn
from tmp_tab
) where r_fn = 1) as firstname,
(select lastname from (
select lastname,
row_number() over (partition by null order by c_ln desc) as r_ln
from tmp_tab
) where r_ln = 1) as lastname,
(select favoriteanimal from (
select favoriteanimal,
row_number() over (partition by null order by c_fa desc) as r_fa
from tmp_tab
) where r_fa = 1) as favoriteanimal,
(select favoritebook from (
select favoritebook,
row_number() over (partition by null order by c_fb desc) as r_fb
from tmp_tab
) where r_fb = 1) as favoritebook
from dual;
FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy Freemont Lemur Ubik
Run Code Online (Sandbox Code Playgroud)
你做一个传过来的CTE为每列,但还是应该只打了真正的表一次(感谢materialize提示).而且你可能想要添加order by条款来调整如果有关系会怎么做.
这在概念上有什么蒂洛,YSTH和其他人提出类似,只不过你让甲骨文保留所有的计数的轨道.
编辑:嗯,解释计划显示它做了四个全表扫描; 可能需要多考虑一下......
编辑2:将(未记录的)MATERIALIZE提示添加到CTE似乎解决了这个问题; 它正在创建一个临时临时表来保存结果,并且只进行一次全表扫描.然而,解释计划成本更高 - 至少在这个时间样本数据集上.对任何不利的评论感兴趣.
| 归档时间: |
|
| 查看次数: |
3530 次 |
| 最近记录: |