TOP(1) BY GROUP 非常大的 (100,000,000+) 表

Ste*_*old 8 index sql-server partitioning azure-sql-database

设置

我有一个大约 115,382,254 行的大表。该表相对简单,记录了应用程序进程操作。

CREATE TABLE [data].[OperationData](
    [SourceDeciveID] [bigint] NOT NULL,
    [FileSource] [nvarchar](256) NOT NULL,
    [Size] [bigint] NULL,
    [Begin] [datetime2](7) NULL,
    [End] [datetime2](7) NOT NULL,
    [Date]  AS (isnull(CONVERT([date],[End]),CONVERT([date],'19000101',(112)))) PERSISTED NOT NULL,
    [DataSetCount] [bigint] NULL,
    [Result] [int] NULL,
    [Error] [nvarchar](max) NULL,
    [Status] [int] NULL,
 CONSTRAINT [PK_OperationData] PRIMARY KEY CLUSTERED 
(
    [SourceDeviceID] ASC,
    [FileSource] ASC,
    [End] ASC
))

CREATE TABLE [model].[SourceDevice](
    [ID] [bigint] IDENTITY(1,1) NOT NULL,
    [Name] [nvarchar](50) NULL,
 CONSTRAINT [PK_DataLogger] PRIMARY KEY CLUSTERED 
(
    [ID] ASC
))

ALTER TABLE [data].[OperationData]  WITH CHECK ADD  CONSTRAINT [FK_OperationData_SourceDevice] FOREIGN KEY([SourceDeviceID])
REFERENCES [model].[SourceDevice] ([ID])
Run Code Online (Sandbox Code Playgroud)

该表每天聚集在大约 500 个集群中。

分区

在此处输入图片说明

此外,该表由 PK 很好地索引,统计数据是最新的,并且 INDEXer 每晚都会进行碎片整理。

基于索引的 SELECT 速度快如闪电,我们对此没有任何问题。

问题

我需要知道最后(TOP)行[End][SourceDeciveID]. 获取[OperationData]每个源设备的最后一个。

我需要找到一种方法来以一种好的方式解决这个问题,并且不会将数据库带到极限。


努力1

第一次尝试很明显GROUP BYSELECT OVER PARTITION BY查询。这里的问题也很明显,每个查询都必须扫描非常分区顺序/找到顶行。所以查询很慢,对IO影响很大。

示例查询 1

;WITH cte AS
(
   SELECT *,
         ROW_NUMBER() OVER (PARTITION BY [SourceDeciveID] ORDER BY [End] DESC) AS rn
   FROM [data].[OperationData]
)
SELECT *
FROM cte
WHERE rn = 1
Run Code Online (Sandbox Code Playgroud)

示例查询 2

SELECT *
FROM [data].[OperationData] AS d 
CROSS APPLY 
(
   SELECT TOP 1 *
   FROM [data].[OperationData] 
   WHERE [SourceDeciveID] = d.[SourceDeciveID]
   ORDER BY [End] DESC
) AS ds
Run Code Online (Sandbox Code Playgroud)

失败的!

努力2

我创建了一个帮助表来始终保存对 TOP 行的引用。

CREATE TABLE [data].[LastOperationData](
    [SourceDeciveID] [bigint] NOT NULL,
    [FileSource] [nvarchar](256) NOT NULL,
    [End] [datetime2](7) NOT NULL,
 CONSTRAINT [PK_LastOperationData] PRIMARY KEY CLUSTERED 
(
    [SourceDeciveID] ASC
)

ALTER TABLE [data].[LastOperationData]  WITH CHECK ADD  CONSTRAINT [FK_LastOperationData_OperationData] FOREIGN KEY([SourceDeciveID], [FileSource], [End])
REFERENCES [data].[OperationData] ([SourceDeciveID], [FileSource], [End])
Run Code Online (Sandbox Code Playgroud)

为了填充表,创建了一个触发器,以便在[End]插入更高的列时始终添加/更新源行。

CREATE TRIGGER [data].[OperationData_Last]
   ON  [data].[OperationData]
   AFTER INSERT
AS 
BEGIN
    SET NOCOUNT ON;

    MERGE [data].[LastOperationData] AS [target]
    USING (SELECT [SourceDeciveID], [FileSource], [End] FROM inserted) AS [source] ([SourceDeciveID], [FileSource], [End])  
    ON ([target].[SourceDeciveID] = [FileSource].[SourceDeciveID])

    WHEN MATCHED AND [target].[End] < [source].[End] THEN
        UPDATE SET [target].[FileSource] = source.[FileSource], [target].[End] = source.[End]

    WHEN NOT MATCHED THEN  
        INSERT ([SourceDeciveID], [FileSource], [End])  
        VALUES (source.[SourceDeciveID], source.[FileSource], source.[End]);

END
Run Code Online (Sandbox Code Playgroud)

这里的问题是,它也有非常巨大的 IO 影响,我不知道为什么。

正如您在查询计划中看到的那样,它还对整个[OperationData]表执行扫描。

它对我的数据库有巨大的整体影响。 统计数据

失败的!

Rob*_*ley 9

如果您有一个SourceID值表,并且您的主表上有一个索引(SourceID, End) include (othercolumns),则只需使用OUTER APPLY.

SELECT d.*
FROM dbo.Sources s
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d
    WHERE d.SourceID = s.SourceID
    ORDER BY d.[End] DESC) d;
Run Code Online (Sandbox Code Playgroud)

如果你知道你只是在你最新的分区之后,你可以在 End 上包含一个过滤器,比如 AND d.[End] > DATEADD(day, -1, GETDATE())

编辑:因为您的聚集索引在 上SourceID, Source, End),所以也将 Source 放入您的 Sources 表中并加入该表。那么你就不需要新的索引了。

SELECT d.*
FROM dbo.Sources s -- Small table
OUTER APPLY (SELECT TOP (1) *
    FROM data.OperationData d -- Big table quick seeks
    WHERE d.SourceID = s.SourceID
    AND d.Source = s.Source
    AND d.[End] > DATEADD(day, -1, GETDATE()) -- If you’re partitioning on [End], do this for partition elimination
    ORDER BY d.[End] DESC) d;
Run Code Online (Sandbox Code Playgroud)