列存储索引维护和查找碎片

Question

列存储索引维护和查找碎片

Jan*_*ert 3 sql-server clustered-index columnstore sql-server-2014 index-maintenance

我们在SQL Server 2014中使用聚集列存储索引和分区。有些表有很多更新，我们发现某些分区有很多已删除的行。在我们的测试中，重建将表大小减少到三分之一并提高 10% 的性能。

除了检查sys.column_store_row_groupsdeleted_rows，是否还有其他的列存储索引碎片指标可以用来确定哪些分区/表应该重建？

Answer 1

She*_*ola 5

首先，您可能希望对这些 COLUMNSTORE INDEXES 应用COMPRESSION_DELAY选项来减少碎片，因为删除压缩行会导致大量碎片，并且即使删除了行，存储仍会使用，最终会对性能产生负面影响。

进一步阅读：

虽然预期的功能在大多数推荐的维护解决方案中仍然开放，但您可能希望从以下脚本开始（从源代码复制）

/*
Rebuild index statement is printed at partition level if
    a. RGQualityMeasure is not met for @PercentageRGQualityPassed Rowgroups
-- this is an arbitrary number, what we are saying is that if the average is above this number, don't bother rebuilding as we consider this number to be good quality rowgroups
    b. Second constraint is the Deleted rows, currently the default that is set am setting is 10% of the partition itself. If the partition is very large or small consider adjusting this
    c. In SQL 2014, post index rebuild,the dmv doesn't show why the RG is trimmed to < 1 million in this case in SQL 2014.
- If the Dictionary is full (16MB) then no use in rebuilding this rowgroup as even after rebuild it may get trimmed
- If dictionary is full only rebuild if deleted rows falls above the threshold
*/

if object_id('tempdb..#temp') IS NOT NULL
drop table #temp
go

Declare @DeletedRowsPercent Decimal (5,2)

-- Debug = 1 if you need all rowgroup information regardless
Declare @Debug int = 0

-- Percent of deleted rows for the partition
Set @DeletedRowsPercent = 10

-- RGQuality means we are saying anything over 500K compressed is good row group quality, anything less need to re-evaluate.
Declare @RGQuality int = 500000

-- means 50% of rowgroups are < @RGQUality from the rows/rowgroup perspective
Declare @PercentageRGQualityPassed smallint = 20

;WITH CSAnalysis
AS
(SELECT     object_id,object_name(object_id) as TableName, 
            index_id,
            rg.partition_number,
            count(*) as CountRGs, 
            sum(total_rows) as TotalRows, 
            Avg(total_rows) as AvgRowsPerRG,
            SUM(CASE WHEN rg.Total_Rows < @RGQuality THEN 1 ELSE 0 END) as CountRGLessThanQualityMeasure, 
            @RGQuality as RGQualityMeasure,
            cast((SUM(CASE 
                        WHEN rg.Total_Rows < @RGQuality THEN 1.0 ELSE 0 
                      END) / count(*) * 100) as Decimal(5,2)) as PercentageRGLessThanQualityMeasure,
            Sum(rg.deleted_rows * 1.0) / sum(rg.total_rows * 1.0) * 100 as DeletedRowsPercent,
            sum (case when rg.deleted_rows > 0 then 1 else 0 end ) as NumRowgroupsWithDeletedRows
FROM sys.column_store_row_groups rg
where rg.state = 3
group by rg.object_id, rg.partition_number,index_id
),

CSDictionaries 
AS
( select     max(dict.on_disk_size) as maxdictionarysize
            ,max(dict.entry_count) as maxdictionaryentrycount
            ,max(partition_number) as maxpartition_number
            ,part.object_id
            ,part.partition_number
from    sys.column_store_dictionaries dict
join    sys.partitions part 
        on dict.hobt_id = part.hobt_id
group by part.object_id, part.partition_number
)

select  a.*, 
        b.maxdictionarysize,
        b.maxdictionaryentrycount,
        maxpartition_number
into #temp 
from        CSAnalysis a
inner join  CSDictionaries b
            on a.object_id = b.object_id 
            and a.partition_number = b.partition_number


-- Maxdop Hint optionally added to ensure we don't spread small amount of rows accross many threads
-- IF we do that, we may end up with smaller rowgroups anyways.

declare @maxdophint smallint, 
        @effectivedop smallint;

-- True if running from the same context that will run the rebuild index.
select @effectivedop = effective_max_dop 
from sys.dm_resource_governor_workload_groups
where group_id in (select group_id from sys.dm_exec_requests where session_id = @@spid)

-- Get the Alter Index Statements.
select  'Alter INDEX ' + QuoteName(IndexName) + ' ON ' + QuoteName(TableName) + ' REBUILD ' +
        Case
        when maxpartition_number = 1 THEN ' '
        else ' PARTITION = ' + cast(partition_number as varchar(10))
        End
        + ' WITH (MAXDOP =' + cast((Case WHEN (TotalRows*1.0/1048576) < 1.0 THEN 1 WHEN (TotalRows*1.0/1048576) < @effectivedop THEN FLOOR(TotalRows*1.0/1048576) ELSE 0 END) as varchar(10)) + ')'
        as Command
from #temp a
inner join
        (   select object_id,index_id,Name as IndexName from sys.indexes
            where type in (5,6) -- non clustered columnstore and clustered columnstore
        ) as b
        on b.object_id = a.object_id and a.index_id = b.index_id
where (DeletedRowsPercent >= @DeletedRowsPercent)
-- Rowgroup Quality trigger, percentage less than rowgroup quality as long as dictionary is not full
OR ( ( ( AvgRowsPerRG < @RGQuality and TotalRows > @RGQuality) AND PercentageRGLessThanQualityMeasure>= @PercentageRGQualityPassed)
AND maxdictionarysize < ( 16*1000*1000)) -- DictionaryNotFull, lower threshold than 16MB.
order by TableName,a.index_id,a.partition_number

-- Debug print if needed
if @Debug=1
Select  getdate() as DiagnosticsRunTime, * 
from    #temp
order by TableName, index_id, partition_number

else

Select getdate() as DiagnosticsRunTime,* 
from #temp
-- Deleted rows trigger
where (DeletedRowsPercent >= @DeletedRowsPercent)
-- Rowgroup Quality trigger, percentage less than rowgroup quality as long as dictionary is not full
OR ( (  ( AvgRowsPerRG < @RGQuality and TotalRows > @RGQuality) AND PercentageRGLessThanQualityMeasure>= @PercentageRGQualityPassed)
        AND maxdictionarysize < ( 16*1000*1000)
    ) -- DictionaryNotFull, lower threshold than 16MB.
order by TableName,index_id,partition_number
-- Add logic to actually run those statements

Run Code Online (Sandbox Code Playgroud)

Answer 2

Joe*_*ish 5

我对列存储索引的生产经验是在 SQL Server 2016 及更高版本上，但我相信这个答案中的所有内容也适用于 SQL Server 2014。最简单的答案是，如果您愿意，您可以查看每个分区使用的行数和空间的比率宁愿不使用sys.column_store_row_groupsdmv。更复杂的答案是列存储索引有许多不同类型的可能碎片。对您来说最重要的取决于您的数据和工作负载。

压缩删除的行- 这是您在问题中提到的行。删除的行占用空间并且没有任何用处。
增量行组中的行- 这些是尚未压缩的行（尚未） - 如果最终有太多行，您会看到对查询性能的影响。
压缩行组不是最大大小（1048576 行） - 由于各种原因，您最终可能会得到行数小于 1048576 的压缩行组。Microsoft 声称，如果行组行数最大化，则表通常会获得最佳压缩。根据我的经验，这取决于数据。我认为我们的数据模型没有可衡量的差异。
未对齐的段- 您可能正在加载数据以在关键列上获得高度选择性的行组消除。如果数据最终不符合首选顺序，例如在维护操作之后，则可以将其视为碎片。

归档时间：	5 年，11 月前
查看次数：	1726 次
最近记录：	5 年，3 月前