Ted*_*een 6 sql-server time-series columnstore azure-sql-database
我的IOT指标(时间序列数据)有一个Clustered Columnstore Index Table.它包含超过10亿行,结构如下:
CREATE TABLE [dbo].[Data](
[DeviceId] [bigint] NOT NULL,
[MetricId] [smallint] NOT NULL,
[TimeStamp] [datetime2](2) NOT NULL,
[Value] [real] NOT NULL
)
CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([TimeStamp],[DeviceId],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
Run Code Online (Sandbox Code Playgroud)
从2008年到现在,有大约10,000个不同的DeviceId值和TimeStamps.针对此表的典型查询如下所示:
SET STATISTICS TIME, IO ON
SELECT
[DeviceId]
,[MetricId]
,DATEADD(hh, DATEDIFF(day, '2005-01-01', [TimeStamp]), '2005-01-01') As [Date]
,MIN([Value]) as [Min]
,MAX([Value]) as [Max]
,AVG([Value]) as [Avg]
,SUM([Value]) as [Sum]
,COUNT([Value]) as [Count]
FROM
[dbo].[Data]
WHERE
[DeviceId] = 6077129891325167032
AND [MetricId] = 1000
AND [TimeStamp] BETWEEN '2017-07-01' AND '2017-07-30'
GROUP BY
[DeviceId]
,[MetricId]
,DATEDIFF(day, '2005-01-01', [TimeStamp])
ORDER BY
[DeviceId]
,[MetricId]
,DATEDIFF(day, '2005-01-01', [TimeStamp])
Run Code Online (Sandbox Code Playgroud)
当我执行此查询时,我得到了性能指标:
因为目前如上所述的查询会进行太多的段读取,我相信:
Table 'Data'. Scan count 2, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 5257, lob physical reads 9, lob read-ahead reads 4000.
Table 'Data'. Segment reads 11, segment skipped 764.
Run Code Online (Sandbox Code Playgroud)
我认为这没有很好的优化,因为有11个段读取只能检索10亿个源行中的212个(在分组/聚合之前)
然后我运行了Niko Neugebauer的优秀脚本来验证我们的设置和Columnstore Alignment https://github.com/NikoNeugebauer/CISL/blob/master/Azure/alignment.sql,我在重建Columnstore Clustered Index后得到了这个结果:
MetricId和TimeStamp列的最佳比对得分为100%.我们如何确保DeviceId列也很好地对齐?我在初始的Clustered(Rowstore)索引中使用了列顺序,那是可以优化的地方吗?
小智 8
通过DeviceId对齐表的关键解决方案是在表上构建聚簇行存储索引,然后在其上构建一个MAXDOP = 1的聚簇列存储索引(为了不引入任何重叠,当索引构建运行多个核时).所以可能的代码看起来像这样:
CREATE CLUSTERED INDEX [PK_Data] ON [dbo].[Data] ([DeviceId],[TimeStamp],[MetricId]) --WITH (DROP_EXISTING = ON)
CREATE CLUSTERED COLUMNSTORE INDEX [PK_Data] ON [dbo].[Data] WITH (DROP_EXISTING = ON, MAXDOP = 1, DATA_COMPRESSION = COLUMNSTORE_ARCHIVE)
Run Code Online (Sandbox Code Playgroud)
另一种可能性是通过准备然后执行对齐功能在CISL中完成所有操作:
insert into dbo.cstore_Clustering( TableName, Partition, ColumnName )
VALUES ('[dbo].[Data]', 1, 'DeviceId' );
Run Code Online (Sandbox Code Playgroud)
这虽然仅用于1个分区,但是一旦你进入你正在使用的数字,你应该考虑对你的表进行分区.设置完成后,您可以开始执行dbo.cstore_doAlignment,它将自动重新对齐并优化您的表.(如果您愿意,您将有一些参数来配置优化的阈值)
最好的问候,Niko