在 PostgreSQL 中使用 Order By 子句进行分区

tes*_*big 3 sql postgresql sql-order-by

我有一个包含这些值的表;

user_id ts                  val
uid1    19.05.2019 01:49:50  0
uid1    19.05.2019 01:50:15  0
uid1    19.05.2019 01:50:20  0
uid1    19.05.2019 01:59:50  1
uid1    19.05.2019 02:20:10  1
uid1    19.05.2019 02:20:15  0
uid1    19.05.2019 02:20:19  0
uid1    19.05.2019 02:30:53  1
uid1    19.05.2019 11:10:25  1
uid1    19.05.2019 11:13:40  0
uid1    19.05.2019 11:13:50  0
uid1    19.05.2019 11:20:19  1
uid2    19.05.2019 15:01:44  0
uid2    19.05.2019 15:05:55  0
uid2    19.05.2019 17:19:35  1
uid2    19.05.2019 17:20:01  0
uid2    19.05.2019 17:20:35  0
uid2    19.05.2019 19:15:50  1
Run Code Online (Sandbox Code Playgroud)

当我只用 partition by 子句查询这个表时,结果看起来像这样;

询问 : select *, sum(val) over (partition by user_id) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  5
uid1    19.05.2019 01:50:15  0  5
uid1    19.05.2019 01:50:20  0  5
uid1    19.05.2019 01:59:50  1  5
uid1    19.05.2019 02:20:10  1  5
uid1    19.05.2019 02:20:15  0  5
uid1    19.05.2019 02:20:19  0  5
uid1    19.05.2019 02:30:53  1  5
uid1    19.05.2019 11:10:25  1  5
uid1    19.05.2019 11:13:40  0  5
uid1    19.05.2019 11:13:50  0  5
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  2
uid2    19.05.2019 15:05:55  0  2
uid2    19.05.2019 17:19:35  1  2
uid2    19.05.2019 17:20:01  0  2
uid2    19.05.2019 17:20:35  0  2
uid2    19.05.2019 19:15:50  1  2
Run Code Online (Sandbox Code Playgroud)

在上面的结果中,res列具有每个分区的val列的总和值。但是,如果我用 partition by 和 order by 查询表,我会得到这些结果;

询问: select *, sum(val) over (partition by user_id order by ts) as res from example_table;

user_id ts                  val res
uid1    19.05.2019 01:49:50  0  0
uid1    19.05.2019 01:50:15  0  0
uid1    19.05.2019 01:50:20  0  0
uid1    19.05.2019 01:59:50  1  1
uid1    19.05.2019 02:20:10  1  2
uid1    19.05.2019 02:20:15  0  2
uid1    19.05.2019 02:20:19  0  2
uid1    19.05.2019 02:30:53  1  3
uid1    19.05.2019 11:10:25  1  4
uid1    19.05.2019 11:13:40  0  4
uid1    19.05.2019 11:13:50  0  4
uid1    19.05.2019 11:20:19  1  5
uid2    19.05.2019 15:01:44  0  0
uid2    19.05.2019 15:05:55  0  0
uid2    19.05.2019 17:19:35  1  1
uid2    19.05.2019 17:20:01  0  1
uid2    19.05.2019 17:20:35  0  1
uid2    19.05.2019 19:15:50  1  2
Run Code Online (Sandbox Code Playgroud)

但是对于 order by 子句,res列具有每个分区的每一行的value列的累积总和。

为什么?我无法理解这一点。

Pau*_*gel 5

更新

此行为记录在此处

4.2.8. 窗口函数调用

[..] 默认的成帧选项是RANGE UNBOUNDED PRECEDING,与RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. 使用 ORDER BY,这会将框架设置为从分区启动到当前行的最后一个ORDER BY对等方的所有行。没有 ORDER BY,这意味着分区的所有行都包含在窗口框架中,因为所有行都成为当前行的对等方。

这意味着:

在没有frame_clauseRANGE UNBOUNDED PRECEDING的情况下-默认使用。包括了:

  • 根据ORDER BY子句“在”当前行之前的所有行
  • 当前行
  • ORDER BY列中与当前行具有相同值的所有行

在没有ORDER BY条款的情况下-ORDER BY NULL是假设的(尽管我又在猜测)。因此,框架将包括分区中的所有行,因为ORDER BY列中的值NULL在每一行中都是相同的(始终是)。

原答案:

免责声明:以下内容更多是猜测而不是合格的答案。我没有找到任何可以证实我所写内容的文档。同时,我认为目前给出的答案并不能正确解释这种行为。

结果差异的原因不直接在于 ORDER BY 子句,因为a + b + cc + b + a. 原因是(这是我的猜测) ORDER BY 子句隐式地将frame_clause定义为

rows between unbounded preceding and current row
Run Code Online (Sandbox Code Playgroud)

尝试以下查询:

select *
, sum(val) over (partition by user_id) as res
, sum(val) over (partition by user_id order by ts) as res_order_by
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and current row
  ) as res_order_by_unbounded_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between unbounded preceding and current row
  ) as res_preceding
, sum(val) over (
    partition by user_id
    -- order by ts
    rows between current row and unbounded following
  ) as res_following
, sum(val) over (
    partition by user_id
    order by ts
    rows between unbounded preceding and unbounded following
  ) as res_orderby_preceding_following

from example_table;
Run Code Online (Sandbox Code Playgroud)

数据库<>小提琴

您将看到,您可以在没有 ORDER BY 子句的情况下获得累积总和,也可以使用 ORDER BY 子句获得“完整”总和。