按时间顺序聚合每天,不使用非等值逻辑

iam*_*ave 17 sql t-sql sql-server join date

初步问题

给定以下数据集与日期表配对:

MembershipId | ValidFromDate | ValidToDate
==========================================
0001         | 1997-01-01    | 2006-05-09
0002         | 1997-01-01    | 2017-05-12
0003         | 2005-06-02    | 2009-02-07
Run Code Online (Sandbox Code Playgroud)

Memberships在任何一天或几天的时间里有多少人开放?

初步答复

下面这个问题被问在这里,这个答案提供了必要的功能:

select d.[Date]
      ,count(m.MembershipID) as MembershipCount
from DIM.[Date] as d
    left join Memberships as m
        on(d.[Date] between m.ValidFromDateKey and m.ValidToDateKey)
where d.CalendarYear = 2016
group by d.[Date]
order by d.[Date];
Run Code Online (Sandbox Code Playgroud)

虽然一位评论者评论说,当非等值的时间太长时,还有其他方法.

跟进

因此,equijoin只有逻辑看起来像复制上面查询的输出?


进展到目前为止

从目前为止提供的答案中我得出了以下内容,它在我使用的320万Membership条记录的硬件上表现优异:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
)
,e as
(
    select d.[Date] as d
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
),c as
(
    select isnull(s.d,e.d) as d
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
)
select d.[Date]
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
;
Run Code Online (Sandbox Code Playgroud)

接下来,为了将这个聚合体分成每天的成分组,我有以下几点,这也表现良好:

declare @s date = '20160101';
declare @e date = getdate();

with s as
(
    select d.[Date] as d
        ,s.MembershipGrouping as g
        ,count(s.MembershipID) as s
    from dbo.Dates as d
        join dbo.Memberships as s
            on d.[Date] = s.ValidFromDateKey
    group by d.[Date]
            ,s.MembershipGrouping
)
,e as
(
    select d.[Date] as d
        ,e..MembershipGrouping as g
        ,count(e.MembershipID) as e
    from dbo.Dates as d
        join dbo.Memberships as e
            on d.[Date] = e.ValidToDateKey
    group by d.[Date]
            ,e.MembershipGrouping
),c as
(
    select isnull(s.d,e.d) as d
            ,isnull(s.g,e.g) as g
            ,sum(isnull(s.s,0) - isnull(e.e,0)) over (partition by isnull(s.g,e.g) order by isnull(s.d,e.d)) as c
    from s
        full join e
            on s.d = e.d
                and s.g = e.g
)
select d.[Date]
    ,c.g
    ,c.c
from dbo.Dates as d
    left join c
        on d.[Date] = c.d
where d.[Date] between @s and @e
order by d.[Date]
        ,c.g
;
Run Code Online (Sandbox Code Playgroud)

任何人都可以改进上述内容吗?

Vla*_*nov 13

如果您的大部分会员资格有效期都超过几天,请查看Martin Smith的答案.这种方法可能会更快.


当您使用日历表(DIM.[Date])并将其与其连接时Memberships,您可能最终会扫描该Memberships表以获取该范围的每个日期.即使有索引(ValidFromDate, ValidToDate),它也可能不是非常有用.

很容易扭转它.Memberships仅扫描一次表,并为每个成员查找有效使用的日期CROSS APPLY.

样本数据

DECLARE @T TABLE (MembershipId int, ValidFromDate date, ValidToDate date);

INSERT INTO @T VALUES
(1, '1997-01-01', '2006-05-09'),
(2, '1997-01-01', '2017-05-12'),
(3, '2005-06-02', '2009-02-07');

DECLARE @RangeFrom date = '2006-01-01';
DECLARE @RangeTo   date = '2006-12-31';
Run Code Online (Sandbox Code Playgroud)

查询1

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= Memberships.ValidFromDate
            AND dbo.Calendar.dt <= Memberships.ValidToDate
            AND dbo.Calendar.dt >= @RangeFrom
            AND dbo.Calendar.dt <= @RangeTo
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE);
Run Code Online (Sandbox Code Playgroud)

OPTION(RECOMPILE) 并不是真的需要,当我比较执行计划时,我将它包含在所有查询中,以确保我在使用查询时获得最新计划.

当我看着这个查询的计划,我看到的是,寻求在Calendar.dt表中只使用了ValidFromDateValidToDate中,@RangeFrom@RangeTo被推到残留谓词.这不是理想的.优化器不够聪明,无法计算两个日期(ValidFromDate@RangeFrom)的最大值,并将该日期用作搜索的起点.

寻求1

很容易帮助优化者:

查询2

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;
Run Code Online (Sandbox Code Playgroud)

在该查询中,搜索是最佳的,并且不读取可能稍后丢弃的日期.

寻求2

最后,您可能不需要扫描整个Memberships表格.我们只需要那些给定的日期范围与成员资格的有效范围相交的行.

查询3

SELECT
    CA.dt
    ,COUNT(*) AS MembershipCount
FROM
    @T AS Memberships
    CROSS APPLY
    (
        SELECT dbo.Calendar.dt
        FROM dbo.Calendar
        WHERE
            dbo.Calendar.dt >= 
                CASE WHEN Memberships.ValidFromDate > @RangeFrom 
                THEN Memberships.ValidFromDate 
                ELSE @RangeFrom END
            AND dbo.Calendar.dt <= 
                CASE WHEN Memberships.ValidToDate < @RangeTo 
                THEN Memberships.ValidToDate 
                ELSE @RangeTo END
    ) AS CA
WHERE
    Memberships.ValidToDate >= @RangeFrom
    AND Memberships.ValidFromDate <= @RangeTo
GROUP BY
    CA.dt
ORDER BY
    CA.dt
OPTION(RECOMPILE)
;
Run Code Online (Sandbox Code Playgroud)

两个间隔[a1;a2][b1;b2]相交时

a2 >= b1 and a1 <= b2
Run Code Online (Sandbox Code Playgroud)

这些查询假定该Calendar表具有索引dt.

您应该尝试查看哪些索引更适合该Memberships表.对于最后一个查询,如果表格相当大,则很可能两个单独的索引on ValidFromDate和on ValidToDate将比一个索引更好(ValidFromDate, ValidToDate).

您应该尝试不同的查询,并使用真实数据在真实硬件上测量它们的性能.性能可能取决于数据分布,有多少成员资格,有效日期,给定范围的宽度或宽度等.

我建议使用一个名为SQL Sentry Plan Explorer的优秀工具来分析和比较执行计划.这是免费的.它显示了许多有用的统计信息,例如每个查询的执行时间和读取次数.上面的屏幕截图来自此工具.

  • ROWS/RANGE性能差异取决于版本.它在最近的版本中修复(在本次会议中提到https://sqlbits.com/Sessions/Event17/Window_Functions) (2认同)

Mar*_*ith 6

假设您的日期维度包含所有成员资格期间中包含的所有日期,您可以使用以下内容.

连接是一个equi连接,因此可以使用散列连接或合并连接,而不仅仅是嵌套循环(它将为每个外部行执行一次内部子树).

假设索引开启(ValidToDate) include(ValidFromDate)或反转,则可以使用单个搜索Memberships单个日期维度扫描.下面有一个不到一秒的时间让我将一年的结果与一个拥有320万会员和一般活跃会员资格140万的表格一起返回(脚本)

DECLARE @StartDate DATE = '2016-01-01',
        @EndDate   DATE = '2016-12-31';

WITH MD
     AS (SELECT Date,
                SUM(Adj) AS MemberDelta
         FROM   Memberships
                CROSS APPLY (VALUES ( ValidFromDate, +1),
                                    --Membership count decremented day after the ValidToDate
                                    (DATEADD(DAY, 1, ValidToDate), -1) ) V(Date, Adj)
         WHERE
          --Members already expired before the time range of interest can be ignored
          ValidToDate >= @StartDate
          AND
          --Members whose membership starts after the time range of interest can be ignored
          ValidFromDate <= @EndDate
         GROUP  BY Date),
     MC
     AS (SELECT DD.DateKey,
                SUM(MemberDelta) OVER (ORDER BY DD.DateKey ROWS UNBOUNDED PRECEDING) AS CountOfNonIgnoredMembers
         FROM   DIM_DATE DD
                LEFT JOIN MD
                  ON MD.Date = DD.DateKey)
SELECT DateKey,
       CountOfNonIgnoredMembers AS MembershipCount
FROM   MC
WHERE  DateKey BETWEEN @StartDate AND @EndDate 
ORDER BY DateKey
Run Code Online (Sandbox Code Playgroud)

演示(使用延长期作为2016年的日历年对示例数据不是很有趣)

在此输入图像描述