dak*_*kes 3 sql-server sql-server-2014
我正在查询包含电影票的表。数据库包含 380k 行。一行代表电影的放映(哪家电影院,什么时候,有多少票,什么价格等等)。
我需要计算几个总计为每一行:Admissions Paid
,Admissions Revenue
,Admissions Free
和Total Admissions
。
对于给定的行,Admissions Paid
是该电影的所有门票的总和,直到price>0
. 其他 3 列的计算方法类似。
我写了一个查询并创建了一个索引:
SELECT [ID]
,[cinema_name]
,[movie_title]
,[price]
,[quantity]
,[start_date_time]
,* --I need all the columns for reporting
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price=0) as [Admissions Free]
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price>0) as [Admissions Paid]
,(select SUM(quantity*price)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time
and price>0) as [Admissions Revenue]
,(select SUM(quantity)
from [movies] i
where i.movie_title=o.movie_title
and i.start_date_time<=o.start_date_time) as [Total Admissions]
FROM [movies] o
Run Code Online (Sandbox Code Playgroud)
我创建了以下索引,将查询时间缩短到 5 分钟:
CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
[movie_title] ASC,
[start_date_time] ASC,
[price] DESC
)
INCLUDE ( [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Run Code Online (Sandbox Code Playgroud)
但是这个索引使查询时间下降到 1:30:
CREATE NONCLUSTERED INDEX [startdatetime_movietitle_price] ON [dbo].[movies]
(
[start_date_time] ASC,
[movie_title] ASC,
[price] DESC
)
INCLUDE ( [quantity]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Run Code Online (Sandbox Code Playgroud)
所以我的问题是:为什么?根据我的理解,首先收集所有电影名称然后查看开始时间更有意义,因为开始时间比电影多。独特的movies
:51,独特的start_date_times
:8786
底层的B-Tree如果start_date_times
先消除不必要的分支,难道不会切断更多的分支吗?
以下是执行计划:
第一张图显示索引的执行计划,movie_title
第一张图,第二张图显示start_date_time
第一个。
第一个索引看起来更适合查询。请提供实际执行计划。
我会尝试使用窗口函数而不是四个相关的子查询。或者单个相关子查询(带有OUTER APPLY
)并查看使用了两个索引中的哪一个。
这两个想法都是强制优化器使用单个索引扫描来收集滚动总和而不是 4(您的两个计划都这样做)。
在请求所有列和仅请求索引中的列时,检查和比较两个执行计划也是值得的:
使用窗口函数:
-- window functions
SELECT
-- m.*,
movie_title, start_date_time,
price, quantity,
SUM(CASE WHEN price = 0 THEN quantity ELSE 0 END)
OVER
(PARTITION BY movie_title
ORDER BY start_date_time
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS [Admissions Free],
SUM(CASE WHEN price > 0 THEN quantity ELSE 0 END)
OVER
(PARTITION BY movie_title
ORDER BY start_date_time
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS [Admissions Paid],
SUM(CASE WHEN price > 0 THEN quantity * price ELSE 0 END)
OVER
(PARTITION BY movie_title
ORDER BY start_date_time
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS [Admissions Revenue],
SUM(quantity)
OVER
(PARTITION BY movie_title
ORDER BY start_date_time
RANGE BETWEEN UNBOUNDED PRECEDING
AND CURRENT ROW
) AS [Total Admissions]
FROM
[movies] AS m ;
Run Code Online (Sandbox Code Playgroud)
*:如果对 有UNIQUE
约束(movie_title, start_date_time)
,那么您可以使用ROWS
而不是RANGE
用于窗口框架(通常效率更高)。从评论来看,没有这样的限制,可能有很多行具有相同的标题和日期时间,所以RANGE
上面是必需的。
使用OUTER APPLY
:
-- using OUTER APPLY
SELECT
-- m.*,
m.movie_title, m.start_date_time,
m.price, m.quantity,
c.[Admissions Free],
c.[Admissions Paid],
c.[Admissions Revenue],
c.[Total Admissions]
FROM
[movies] AS m
OUTER APPLY
( SELECT
SUM(CASE WHEN i.price = 0 THEN i.quantity ELSE 0 END)
AS [Admissions Free],
SUM(CASE WHEN i.price > 0 THEN i.quantity ELSE 0 END)
AS [Admissions Paid],
SUM(CASE WHEN i.price > 0 THEN i.quantity * i.price ELSE 0 END)
AS [Admissions Revenue],
SUM(i.quantity)
AS [Total Admissions]
FROM [movies] AS i
WHERE i.movie_title = o.movie_title
AND i.start_date_time <= o.start_date_time
) AS c ;
Run Code Online (Sandbox Code Playgroud)
这个索引可能比第一个好一点:
(
movie_title ASC,
start_date_time ASC
)
INCLUDE (price, quantity)
Run Code Online (Sandbox Code Playgroud)