SQL连接具有一系列值(int范围,日期范围,等等)

Chr*_*ris 4 sql optimization performance query-optimization

我有两个表,第一个是一个大表(数百万行),最有趣的列是一个整数,我只称之为"键".我相信这个解决方案对于日期或日期时间范围也是相同的.

第二个表要小得多(数千行),有一堆我感兴趣的属性,这些属性是在一系列键上定义的.它具有以下结构:

key_lower_bound:int key_upper_bound:int interesting_value1:float interesting_value2:int interesting_value3:varchar(50)...

我想查找第一个表中的所有值,并根据第一个表中的键是否落在区间[key_lower_bound,key_upper_bound]内,将它们与第二个表"连接"起来.

这有点像稀疏内积或稀疏点积在数学上,但它有点奇怪,因为第二个表中涉及这些范围.不过,如果我在代码中写这个,那将是一个O(|第一个表| + |第二个表|)算法.我会指向两个(已排序)列表并逐个浏览它们,以确定第一个表中的每个键是否属于第二个表的范围.诀窍在于,每次检查第一个表中的键时,我都不会遍历第二个列表,因为两个列表都已排序.

当我构造最客观的SQL查询(涉及检查该键是> key_lower_bound和<key_upper_bound)时,它需要花费太长时间.

这个天真的查询会发生某种二次行为,因为我认为查询引擎正在对第二个表中的每一行进行比较,而实际上,如果第二个表是按key_lower_bounds排序的,那么这不是必需的.所以我得到一个O(|第一个表| x​​ |第二个表|)行为而不是所需的O(|第一个表| + |第二个表|)行为.

是否可以获得线性SQL查询来执行此操作?

A-K*_*A-K 6

好吧,我已经解决了这个问题并提出了一些建议.但首先让我们填充帮助表

CREATE TABLE dbo.Numbers(n INT NOT NULL PRIMARY KEY)
GO
DECLARE @i INT;
SET @i = 1;
INSERT INTO dbo.Numbers(n) SELECT 1;
WHILE @i<1024000 BEGIN
  INSERT INTO dbo.Numbers(n)
    SELECT n + @i FROM dbo.Numbers;
  SET @i = @i * 2;
END;
GO
Run Code Online (Sandbox Code Playgroud)

和测试数据,一分钟一分钟广告一年,同一年每分钟一个客户电话:

CREATE TABLE dbo.Commercials(
  StartedAt DATETIME NOT NULL 
    CONSTRAINT PK_Commercials PRIMARY KEY,
  EndedAt DATETIME NOT NULL,
  CommercialName VARCHAR(30) NOT NULL);
GO
INSERT INTO dbo.Commercials(StartedAt, EndedAt, CommercialName)
SELECT DATEADD(minute, n - 1, '20080101')
    ,DATEADD(minute, n, '20080101')
    ,'Show #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE TABLE dbo.Calls(CallID INT 
  CONSTRAINT PK_Calls NOT NULL PRIMARY KEY,
  AirTime DATETIME NOT NULL,
  SomeInfo CHAR(300));
GO
INSERT INTO dbo.Calls(CallID,
  AirTime,
  SomeInfo)
SELECT n 
    ,DATEADD(minute, n - 1, '20080101')
    ,'Call during Commercial #'+CAST(n AS VARCHAR(6))
  FROM dbo.Numbers
  WHERE n<=24*365*60;
GO
CREATE UNIQUE INDEX Calls_AirTime
  ON dbo.Calls(AirTime) INCLUDE(SomeInfo);
GO
Run Code Online (Sandbox Code Playgroud)

原来尝试在一年中的商业广告中选择所有拨打电话的时间非常缓慢:

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
GO

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 30 ms.

(1 row(s) affected)
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 2, logical reads 3338264, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 2, logical reads 7166, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 71704 ms,  elapsed time = 36316 ms.
Run Code Online (Sandbox Code Playgroud)

原因很简单:我们知道商业广告不会重叠,因此一个广告最多只适用于一个广告,但优化商并不知道.我们知道广告很短,但优化者也不知道.这两个假设都可以作为约束强制执行,但优化器不会不会.

假设广告不超过15分钟,我们可以告诉优化器,查询速度非常快:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Commercials s JOIN dbo.Calls c 
  ON c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
AND s.StartedAt BETWEEN '20080630 23:45' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 15 ms, elapsed time = 15 ms.

(1 row(s) affected)
Table 'Worktable'. Scan count 1, logical reads 753, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Commercials'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 24 ms.
Run Code Online (Sandbox Code Playgroud)

假设广告没有重叠,所以一个调用最多适合一个广告,我们可以告诉优化器,查询再次非常快:

SELECT COUNT(*) FROM(
SELECT s.StartedAt, s.EndedAt, c.AirTime
FROM dbo.Calls c CROSS APPLY(
  SELECT TOP 1 s.StartedAt, s.EndedAt FROM dbo.Commercials s 
  WHERE c.AirTime >= s.StartedAt AND c.AirTime < s.EndedAt
  ORDER BY s.StartedAt DESC) AS s
WHERE c.AirTime BETWEEN '20080701' AND '20080701 03:00'
) AS t;

SQL Server parse and compile time: 
   CPU time = 0 ms, elapsed time = 7 ms.

(1 row(s) affected)
Table 'Commercials'. Scan count 181, logical reads 1327, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Calls'. Scan count 1, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 31 ms.
Run Code Online (Sandbox Code Playgroud)