r.z*_*rei 6 sql-server aggregate t-sql group-by
我有这样结构的表:
+-------+------------------+
| Value | Date |
+-------+------------------+
| 10 | 10/10/2010 10:00 |
| 11 | 10/10/2010 10:15 |
| 15 | 10/10/2010 10:30 |
| 15 | 10/10/2010 10:45 |
| 17 | 10/10/2010 11:00 |
| 18 | 10/10/2010 11:15 |
| 22 | 10/10/2010 11:30 |
| 30 | 10/10/2010 11:45 |
+-------+------------------+
Run Code Online (Sandbox Code Playgroud)
目前我正在使用 group by 来获取 min、max、avg 来获取这样的每小时报告:
+-----+-----+-------+------------------+
| min | max | avg | Date |
+-----+-----+-------+------------------+
| 10 | 15 | 12.75 | 10/10/2010 10:00 |
| 17 | 30 | 21.75 | 10/10/2010 11:00 |
+-----+-----+-------+------------------+
Run Code Online (Sandbox Code Playgroud)
我如何计算每组中最后一行和第一行值的差异以生成如下内容:
+-----+-----+-------+------+------------------+
| min | max | avg | diff | Date |
+-----+-----+-------+------+------------------+
| 10 | 15 | 12.75 | 5 | 10/10/2010 10:00 |
| 17 | 30 | 21.75 | 13 | 10/10/2010 11:00 |
+-----+-----+-------+------+------------------+
Run Code Online (Sandbox Code Playgroud)
谢谢。
And*_*y M 13
您没有显示用于在没有diff
. 我假设它是这样的:
SELECT
min = MIN(Value),
max = MAX(Value),
avg = AVG(Value), -- or, if Value is an int, like this, perhaps:
-- AVG(CAST(Value AS decimal(10,2))
Date = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM atable
GROUP BY
DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
;
Run Code Online (Sandbox Code Playgroud)
此外,您没有解释first和last 的含义。在这个答案,假设第一代表组中最早的(根据Date
值),同样,最后的手段最新的组中。
投入的一种方法diff
可能是这样的:
首先,将另外两个聚合列minDate
和maxDate
, 添加到原始查询中:
SELECT
min = MIN(Value),
max = MAX(Value),
avg = AVG(Value),
minDate = MIN(Date),
maxDate = MAX(Date),
Date = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM atable
GROUP BY
DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
;
Run Code Online (Sandbox Code Playgroud)
接着,加入聚合结果集回原始表上minDate
和上maxDate
(分别)以访问相应的Value
S:
SELECT
g.min,
g.max,
g.avg,
diff = last.Value - first.Value,
g.Date
FROM (
SELECT
min = MIN(Value),
max = MAX(Value),
avg = AVG(Value),
minDate = MIN(Date),
maxDate = MAX(Date),
Date = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM atable
GROUP BY
DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
) g
INNER JOIN atable first ON first.Date = g.minDate
INNER JOIN atable last ON last .Date = g.maxDate
;
Run Code Online (Sandbox Code Playgroud)
请注意,以上假设Date
值(至少那些恰好在其相应小时内出现在第一个或最后一个的值)是唯一的,或者您将在输出中的某些小时内获得不止一行。
另一种方法是,如果您使用的是 SQL Server 2005 或更高版本,则可以使用窗口聚合函数MIN() OVER (...)
并MAX() OVER (...)
计算Value
对应于minDate
或 的s maxDate
,然后再聚合所有结果,类似于您现在可能正在执行的操作。以下是我具体要说的:
WITH partitioned AS (
SELECT
Value,
Date,
GroupDate = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM atable
)
, firstlast AS (
SELECT
Value,
Date,
GroupDate,
FirstValue = CASE Date WHEN MIN(Date) OVER (PARTITION BY GroupDate) THEN Value END,
LastValue = CASE Date WHEN MAX(Date) OVER (PARTITION BY GroupDate) THEN Value END
FROM partitioned
)
SELECT
min = MIN(Value),
max = MAX(Value),
avg = AVG(Value), -- or, again, if Value is an int, cast it as a decimal or float
diff = MAX(LastValue) - MIN(FirstValue),
Date = GroupDate
FROM firstlast
GROUP BY
GroupDate
;
Run Code Online (Sandbox Code Playgroud)
如您所见,第一个公用表表达式 (CTE)仅返回所有行并添加一个计算列GroupDate
,该列随后用于分组/分区。所以它本质上只是为分组表达式分配一个名称,这样做是为了提高整个查询的可读性/可维护性,因为该列后来被多次引用。这是第一个 CTE 产生的结果:
+-------+------------------+------------------+
| Value | Date | GroupDate |
+-------+------------------+------------------+
| 10 | 10/10/2010 10:00 | 10/10/2010 10:00 |
| 11 | 10/10/2010 10:15 | 10/10/2010 10:00 |
| 15 | 10/10/2010 10:30 | 10/10/2010 10:00 |
| 15 | 10/10/2010 10:45 | 10/10/2010 10:00 |
| 17 | 10/10/2010 11:00 | 10/10/2010 11:00 |
| 18 | 10/10/2010 11:15 | 10/10/2010 11:00 |
| 22 | 10/10/2010 11:30 | 10/10/2010 11:00 |
| 30 | 10/10/2010 11:45 | 10/10/2010 11:00 |
+-------+------------------+------------------+
Run Code Online (Sandbox Code Playgroud)
第二个 CTE 向上述结果添加了两列。它使用窗口聚合函数MIN() OVER ...
并MAX() OVER ...
匹配Date
,并且在匹配发生的地方,相应Value
的在单独的列中返回,FirstValue
或者LastValue
:
+-------+------------------+------------------+------------+-----------+
| Value | Date | GroupDate | FirstValue | LastValue |
+-------+------------------+------------------+------------+-----------+
| 10 | 10/10/2010 10:00 | 10/10/2010 10:00 | 10 | NULL |
| 11 | 10/10/2010 10:15 | 10/10/2010 10:00 | NULL | NULL |
| 15 | 10/10/2010 10:30 | 10/10/2010 10:00 | NULL | NULL |
| 15 | 10/10/2010 10:45 | 10/10/2010 10:00 | NULL | 15 |
| 17 | 10/10/2010 11:00 | 10/10/2010 11:00 | 17 | NULL |
| 18 | 10/10/2010 11:15 | 10/10/2010 11:00 | NULL | NULL |
| 22 | 10/10/2010 11:30 | 10/10/2010 11:00 | NULL | NULL |
| 30 | 10/10/2010 11:45 | 10/10/2010 11:00 | NULL | 30 |
+-------+------------------+------------------+------------+-----------+
Run Code Online (Sandbox Code Playgroud)
至此,一切准备就绪,可以进行最后的聚合了。的min
,max
和avg
列将被聚集的与上文相同的,并且diff
现在可以轻松地作为聚合来获得FirstValue
从所述聚合中减去LastValue
。从上面的结果集中可以看出,您可以使用各种函数来获取FirstValue
和LastValue
用于组:它可以是MIN
, MAX
, SUM
, AVG
– 任何可以,因为每个组中只有一个值。
主要的选择,但是,正如你所看到的,特别适用MAX()
过LastValue
和MIN()
超过FirstValue
。那是故意的。这是因为第二个建议并不Date
像第一个建议那样真正需要是唯一的,但是,如果minDate
或maxDate
碰巧有多个关联Value
,它会导致FirstValue
或LastValue
包含每个组多个值,例如这个:
+-------+------------------+------------------+------------+-----------+
| Value | Date | GroupDate | FirstValue | LastValue |
+-------+------------------+------------------+------------+-----------+
| 9 | 10/10/2010 10:00 | 10/10/2010 10:00 | 9 | NULL |
| 10 | 10/10/2010 10:00 | 10/10/2010 10:00 | 10 | NULL |
| 11 | 10/10/2010 10:15 | 10/10/2010 10:00 | NULL | NULL |
| 15 | 10/10/2010 10:30 | 10/10/2010 10:00 | NULL | NULL |
| 15 | 10/10/2010 10:45 | 10/10/2010 10:00 | NULL | 15 |
| 17 | 10/10/2010 11:00 | 10/10/2010 11:00 | 17 | NULL |
| 18 | 10/10/2010 11:15 | 10/10/2010 11:00 | NULL | NULL |
| 22 | 10/10/2010 11:30 | 10/10/2010 11:00 | NULL | NULL |
| 30 | 10/10/2010 11:45 | 10/10/2010 11:00 | NULL | 30 |
| 33 | 10/10/2010 11:45 | 10/10/2010 11:00 | NULL | 33 |
+-------+------------------+------------------+------------+-----------+
Run Code Online (Sandbox Code Playgroud)
我认为在这种情况下,取最大的最后一个值和最小的第一个值之间的差异会更自然。但是,您应该更清楚在此处应用什么规则,因此您只需相应地更改查询即可。
您可以在 SQL Fiddle 测试这两种解决方案:
更新
从 SQL Server 2012 开始,您还可以使用FIRST_VALUE和LAST_VALUE函数并将它们替换为firstlast
我上面最后一个查询中 CTE 中的 CASE 表达式,如下所示:
FirstValue = FIRST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
LastValue = LAST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
Run Code Online (Sandbox Code Playgroud)
在这种情况下,它不会不管你是否使用MIN或MAX以上FirstValue
和LastValue
更高版本(在主SELECT):每列将具有完全相同的值(第一个或最后一个Value
相应)跨相同的所有行GroupDate
组,所以MIN()
和MAX()
将返回每种情况下的结果相同。
实际上,您可以diff
直接在firstlast
CTE 中获取,然后在主查询中,只需使用 MIN/MAX 聚合它或将其添加到 GROUP BY 并在不聚合的情况下引用它,如下所示:
WITH partitioned AS (
SELECT
Value,
Date,
GroupDate = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM atable
)
, firstlast AS (
SELECT
Value,
Date,
GroupDate,
diff = LAST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- FIRST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM partitioned
)
SELECT
min = MIN(Value),
max = MAX(Value),
avg = AVG(Value),
diff,
Date = GroupDate
FROM firstlast
GROUP BY
GroupDate,
diff
;
Run Code Online (Sandbox Code Playgroud)
再进一步,您可以获得min
,max
和avg
in firstlast
,而不是主查询 - 使用对应的窗口函数:
min = MIN(Value) OVER (PARTITION BY GroupDate),
max = MAX(Value) OVER (PARTITION BY GroupDate),
avg = AVG(Value) OVER (PARTITION BY GroupDate),
Run Code Online (Sandbox Code Playgroud)
通过这三个额外的列和之前的更改,firstlast
CTE 将为您的示例返回如下所示的行集:
+-------+------------------+------------------+-----+-----+-------+------+
| Value | Date | GroupDate | min | max | avg | diff |
+-------+------------------+------------------+-----+-----+-------+------+
| 10 | 10/10/2010 10:00 | 10/10/2010 10:00 | 10 | 15 | 12.75 | 5 |
| 11 | 10/10/2010 10:15 | 10/10/2010 10:00 | 10 | 15 | 12.75 | 5 |
| 15 | 10/10/2010 10:30 | 10/10/2010 10:00 | 10 | 15 | 12.75 | 5 |
| 15 | 10/10/2010 10:45 | 10/10/2010 10:00 | 10 | 15 | 12.75 | 5 |
| 17 | 10/10/2010 11:00 | 10/10/2010 11:00 | 17 | 30 | 21.75 | 13 |
| 18 | 10/10/2010 11:15 | 10/10/2010 11:00 | 17 | 30 | 21.75 | 13 |
| 22 | 10/10/2010 11:30 | 10/10/2010 11:00 | 17 | 30 | 21.75 | 13 |
| 30 | 10/10/2010 11:45 | 10/10/2010 11:00 | 17 | 30 | 21.75 | 13 |
+-------+------------------+------------------+-----+-----+-------+------+
Run Code Online (Sandbox Code Playgroud)
请注意GroupDate
, min
, max
,avg
和diff
– 您真正需要用于最终集合的列 –如何在属于同一组的所有行中简单地重复。这意味着您可以去掉Value
and Date
,重命名GroupDate
为Date
,稍微重新排列列,将 DISTINCT 应用于结果集——并且您已经消除了最后一个 SELECT:
WITH partitioned AS (
SELECT
Value,
Date,
GroupDate = DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)
FROM
atable
)
SELECT DISTINCT
min = MIN(Value) OVER (PARTITION BY GroupDate),
max = MAX(Value) OVER (PARTITION BY GroupDate),
avg = AVG(Value) OVER (PARTITION BY GroupDate),
diff = LAST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- FIRST_VALUE(Value) OVER (PARTITION BY GroupDate ORDER BY Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
Date = GroupDate
FROM
partitioned
;
Run Code Online (Sandbox Code Playgroud)
最后,还可以将移动GroupDate
计算成其中相同的范围min
,max
等等被计算。您可以为此使用 CROSS APPLY,从而避免完全嵌套查询的需要——换句话说,这样您也可以摆脱partitioned
CTE。在整个查询应该是这样的:
SELECT DISTINCT
min = MIN(t.Value) OVER (PARTITION BY x.GroupDate),
max = MAX(t.Value) OVER (PARTITION BY x.GroupDate),
avg = AVG(t.Value) OVER (PARTITION BY x.GroupDate),
diff = LAST_VALUE(t.Value) OVER (PARTITION BY x.GroupDate ORDER BY t.Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
- FIRST_VALUE(t.Value) OVER (PARTITION BY x.GroupDate ORDER BY t.Date ASC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
Date = x.GroupDate
FROM
atable AS t
CROSS APPLY (SELECT DATEADD(HOUR, DATEDIFF(HOUR, 0, Date), 0)) AS x (GroupDate)
;
Run Code Online (Sandbox Code Playgroud)
并返回相同的结果。您也可以在 SQL Fiddle 上测试它。