SQL Server 2016 SP1 CU3 清除计划缓存

Grz*_*rzO 2 sql-server plan-cache sql-server-2016

每隔几分钟到大约一小时,进程缓存就会被刷新(但不是完全!)。如果我运行:

SELECT count (*) FROM sys.dm_exec_cached_plans
Run Code Online (Sandbox Code Playgroud)

……刚刚清零之后,计划的数量下降到几百个,然后逐渐增加到大约2000个,然后再次清零,依此类推。

服务器在 VMWare 上运行,它有128 GB的 RAM(SQL Server 最大服务器内存设置为102 GB,最小服务器内存设置为72 GB)。根据 SentryOne 的输出,我可以看到缓冲池消耗了~61 GB。我没有看到任何内存压力指标。

SentryOne SQL Server 内存使用情况

我还观察到,当我将最小服务器内存减少到 16 GB 时,进程缓存清除的频率急剧增加。

我的SQL Server版本如下:

Microsoft SQL Server 2016 (SP1-CU3) (KB4019916) - 13.0.4435.0 (X64)
Run Code Online (Sandbox Code Playgroud)

还有什么:

  • 我还没有尝试过 LPIM,因为它需要重新启动 SQL Server,但即使它是解决方案,我也很想了解为什么会发生这个问题。此外,Brent建议它仅在缓冲池修整的情况下有帮助。在我的情况下,缓冲池看起来完好无损,只有 proc 缓存受到影响。
  • 我没有找到任何可以对此负责的工作,也找不到任何真正的模式。
  • 如果有人发现这对修复/调查此问题有用,我可以访问 vCenter。但如果是这种情况,请您告诉我我应该在那里检查什么。

我怀疑 VMWare 可能在这里扮演了一些角色,但是当我检查来宾端可用的 perfmon 计数器时,我没有发现任何可疑的东西。

VM 内存性能计数器

我还根据Brent 的建议收集了sp_Blitz的输出:

**Priority 10: Performance**:

- Query Store Disabled - The new SQL Server 2016 Query Store feature has not been enabled on this database.

    * Database1

    * Database2

    * distribution

    * Database3

    * Database3Integration

    * Database3IntegrationSandBox

    * Database4

    * Database5

    * Database6

    * Database6Sandbox

    * Database7

    * Database8

    * Database9


**Priority 50: Performance**:

- Poison Wait Detected: Serializable Locking  - 0:04:49:23 of LCK_M_R% waits have been recorded. This wait often indicates killer performance problems.


**Priority 50: Reliability**:

- Transaction Log Larger than Data File
    * Database2 - The database [Database2] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

    * Database4 - The database [Database4] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

    * Database8 - The database [Database8] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.

    * Database9 - The database [Database9] has a 1 GB transaction log file, larger than the total data file sizes. This may indicate that transaction log backups are not being performed or not performed often enough.


**Priority 100: Performance**:

- Fill Factor Changed Database3 - The [Database3] database has objects with fill factor < 80%. This can cause memory and storage performance problems, but may also prevent page splits.

- Many Plans for One Query  - 180 plans are present for a single query in the plan cache - meaning we probably have parameterization issues.

- Stored Procedure WITH RECOMPILE distribution - [distribution].[dbo].[sp_MSdelete_dodelete] has WITH RECOMPILE in the stored procedure code, which may cause increased CPU usage due to constant recompiles of the code.


**Priority 110: Performance**:

- Active Tables Without Clustered Indexes
    * Database3 - The [Database3] database has heaps - tables without a clustered index - that are being actively queried.

    * Database3Integration - The [Database3Integration] database has heaps - tables without a clustered index - that are being actively queried.

    * Database3IntegrationSandBox - The [Database3IntegrationSandBox] database has heaps - tables without a clustered index - that are being actively queried.

    * msdb - The [msdb] database has heaps - tables without a clustered index - that are being actively queried.

    * Database6 - The [Database6] database has heaps - tables without a clustered index - that are being actively queried.

    * Database6Sandbox - The [Database6Sandbox] database has heaps - tables without a clustered index - that are being actively queried.

    * Database9 - The [Database9] database has heaps - tables without a clustered index - that are being actively queried.


**Priority 150: Performance**:

- Deadlocks Happening Daily  - 116 deadlocks have been recorded since startup.

- Foreign Keys Not Trusted
    * Database3 - The [Database3] database has foreign keys that were probably disabled, data was changed, and then the key was enabled again.  Simply enabling the key is not enough for the optimizer to use this key - we have to alter the table using the WITH CHECK CHECK CONSTRAINT parameter.

    * Database6 - The [Database6] database has foreign keys that were probably disabled, data was changed, and then the key was enabled again.  Simply enabling the key is not enough for the optimizer to use this key - we have to alter the table using the WITH CHECK CHECK CONSTRAINT parameter.

- Inactive Tables Without Clustered Indexes
    * distribution - The [distribution] database has heaps - tables without a clustered index - that have not been queried since the last restart.  These may be backup tables carelessly left behind.

    * Database3 - The [Database3] database has heaps - tables without a clustered index - that have not been queried since the last restart.  These may be backup tables carelessly left behind.

    * msdb - The [msdb] database has heaps - tables without a clustered index - that have not been queried since the last restart.  These may be backup tables carelessly left behind.

- Queries Forcing Join Hints  - 1106 instances of join hinting have been recorded since restart.  This means queries are bossing the SQL Server optimizer around, and if they don't know what they're doing, this can cause more harm than good.  This can also explain why DBA tuning efforts aren't working.

- Queries Forcing Order Hints  - 1106 instances of order hinting have been recorded since restart.  This means queries are bossing the SQL Server optimizer around, and if they don't know what they're doing, this can cause more harm than good.  This can also explain why DBA tuning efforts aren't working.

- Triggers on Tables
    * Database3 - The [Database3] database has 2 triggers.

    * Database9 - The [Database9] database has 2 triggers.


**Priority 170: Reliability**:

- Max File Size Set
    * Database3 - The [Database3] database file Database3Prod_log has a max file size set to 40960MB. If it runs out of space, the database will stop working even though there may be drive space available.

    * Database8 - The [Database8] database file db263426220 has a max file size set to 1024MB. If it runs out of space, the database will stop working even though there may be drive space available.

    * Database9 - The [Database9] database file Database9_log has a max file size set to 10000MB. If it runs out of space, the database will stop working even though there may be drive space available.


**Priority 200: Backup**:

- MSDB Backup History Not Purged msdb - Database backup history retained back to Feb 27 2017  3:48PM


**Priority 200: Informational**:

- Agent Jobs Starting Simultaneously  - Multiple SQL Server Agent jobs are configured to start simultaneously. For detailed schedule listings, see the query in the URL.

- Cluster Node  - This is a node in a cluster.

- Collation is SQL_Latin1_General_CP1_CI_AS - Collation differences between user databases and tempdb can cause conflicts especially when comparing string values

    * Database3Integration

    * Database3IntegrationSandBox

    * Database7

    * Database8

- Replication In Use
    * distribution - Database [distribution] is a replication publisher, subscriber, or distributor.

    * Database3 - Database [Database3] is a replication publisher, subscriber, or distributor.

- Tables in the MSDB Database
 - The SQLSentryAlertLog_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryDBEmails_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryDBEmails_Attachments_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryEmails_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryLogCache_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryLogData_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryObjectVersion_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

 - The SQLSentryQueueLog_20 table in the msdb database was created by end users on Mar 10 2017  4:40PM. Tables in the msdb database may not be restored in the event of a disaster.

- TraceFlag On  - Trace flag 3226 is enabled globally.


**Priority 200: Monitoring**:

- Agent Jobs Without Failure Emails
 - The job Agent history clean up: distribution has not been set up to notify an operator if it fails.

 - The job Database1_Job_DeleteExpiredSessions has not been set up to notify an operator if it fails.

 - The job CommandLog Cleanup has not been set up to notify an operator if it fails.

 - The job DatabaseBackup - SYSTEM_DATABASES - FULL has not been set up to notify an operator if it fails.

 - The job DatabaseBackup - USER_DATABASES - DIFF has not been set up to notify an operator if it fails.

 - The job DatabaseBackup - USER_DATABASES - FULL has not been set up to notify an operator if it fails.

 - The job DatabaseBackup - USER_DATABASES - LOG has not been set up to notify an operator if it fails.

 - The job DatabaseIntegrityCheck - SYSTEM_DATABASES has not been set up to notify an operator if it fails.

 - The job DatabaseIntegrityCheck - USER_DATABASES has not been set up to notify an operator if it fails.

 - The job Distribution clean up: distribution has not been set up to notify an operator if it fails.

 - The job Expired subscription clean up has not been set up to notify an operator if it fails.

 - The job IndexOptimize - USER_DATABASES has not been set up to notify an operator if it fails.

 - The job IntegrationAPI_SSIS_Integration has not been set up to notify an operator if it fails.

 - The job Output File Cleanup has not been set up to notify an operator if it fails.

 - The job Replication agents checkup has not been set up to notify an operator if it fails.

 - The job sp_cycle_errorlog has not been set up to notify an operator if it fails.


 - The job sp_delete_backuphistory has not been set up to notify an operator if it fails.

 - The job sp_purge_jobhistory has not been set up to notify an operator if it fails.

 - The job SSIS Server Maintenance Job has not been set up to notify an operator if it fails.

 - The job StatisticsUpdate - USER_DATABASES has not been set up to notify an operator if it fails.

- Alerts Disabled
 - The following Alert is disabled, please review and enable if desired: Peer-to-peer conflict detection alert

 - The following Alert is disabled, please review and enable if desired: Replication: agent custom shutdown

 - The following Alert is disabled, please review and enable if desired: Replication: agent retry

 - The following Alert is disabled, please review and enable if desired: Replication: agent success

 - The following Alert is disabled, please review and enable if desired: Replication: expired subscription dropped

 - The following Alert is disabled, please review and enable if desired: Replication: Subscriber has failed data validation

 - The following Alert is disabled, please review and enable if desired: Replication: Subscriber has passed data validation

 - The following Alert is disabled, please review and enable if desired: Replication: Subscription reinitialized after validation failure

- No Alerts for Corruption  - SQL Server Agent alerts do not exist for errors 823, 824, and 825.  These three errors can give you notification about early hardware failure. Enabling them can prevent you a lot of heartbreak.

- No Alerts for Sev 19-25  - SQL Server Agent alerts do not exist for severity levels 19 through 25.  These are some very severe SQL Server errors. Knowing that these are happening may let you recover from errors faster.

- No failsafe operator configured  - No failsafe operator is configured on this server.  This is a good idea just in-case there are issues with the [msdb] database that prevents alerting.

- Not All Alerts Configured  - Not all SQL Server Agent alerts have been configured.  This is a free, easy way to get notified of corruption, job failures, or major outages even before monitoring systems pick it up.


**Priority 200: Non-Default Server Config**:

- Agent XPs  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- backup checksum default  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- backup compression default  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- clr enabled  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- cost threshold for parallelism  - This sp_configure option has been changed.  Its default value is 5 and it has been set to 50.

- Database Mail XPs  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- max degree of parallelism  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 8.

- max server memory (MB)  - This sp_configure option has been changed.  Its default value is 2147483647 and it has been set to 106496.

- min server memory (MB)  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 73728.

- optimize for ad hoc workloads  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- remote admin connections  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- scan for startup procs  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.

- show advanced options  - This sp_configure option has been changed.  Its default value is 0 and it has been set to 1.


**Priority 200: Performance**:

- User-Created Statistics In Place
    * distribution - [distribution] has 10 user-created statistics. This indicates that someone is being a rocket scientist with the stats, and might actually be slowing things down, especially during stats updates.

    * Database3 - [Database3] has 7 user-created statistics. This indicates that someone is being a rocket scientist with the stats, and might actually be slowing things down, especially during stats updates.


**Priority 210: Non-Default Database Config**:

- ANSI NULL Default Enabled - This database setting is not the default.

    * Database3Integration

    * Database3IntegrationSandBox

- Read Committed Snapshot Isolation Enabled - This database setting is not the default.

    * Database6

    * Database6Sandbox


**Priority 240: Wait Stats**:

- 1 - CXPACKET  - 817.9 hours of waits, 259.3 minutes average wait time per hour, 0.7% signal wait, 45111405 waiting tasks, 65.3 ms average wait time.

- 2 - TRACEWRITE  - 188.8 hours of waits, 59.9 minutes average wait time per hour, 0.0% signal wait, 354542 waiting tasks, 1917.4 ms average wait time.


**Priority 250: Server Info**:

- Default Trace Contents  - The default trace holds 28 hours of data between Jul 31 2017  3:30AM and Aug  1 2017  7:55AM. The default trace files are located in: E:\MSSQL13.MSSQLSERVER\MSSQL\Log

- Drive C Space  - 35530.00MB free on C drive

- Drive E Space  - 129384.00MB free on E drive

- Drive F Space  - 135632.00MB free on F drive

- Drive G Space  - 13597.00MB free on G drive

- Drive H Space  - 102956.00MB free on H drive

- Drive I Space  - 4207.00MB free on I drive

- Drive Q Space  - 9619.00MB free on Q drive

- Drive R Space  - 255214.00MB free on R drive

- Hardware  - Logical processors: 8. Physical memory: 128GB.

- Hardware - NUMA Config  - Node: 0 State: ONLINE Online schedulers: 8 Offline schedulers: 0 Processor Group: 0 Memory node: 0 Memory VAS Reserved GB: 160

- Server Last Restart  - Jul 23 2017  8:07PM

- Server Name  - SecretServerName

- Services
 - Service: SQL Full-text Filter Daemon Launcher (MSSQLSERVER) runs under service account NT Service\MSSQLFDLauncher. Last startup time: not shown.. Startup type: Manual, currently Running.

 - Service: SQL Server (MSSQLSERVER) runs under service account SecretDomain\SQLSERVER. Last startup time: Jul 24 2017 10:39AM. Startup type: Manual, currently Running.

 - Service: SQL Server Agent (MSSQLSERVER) runs under service account SecretDomain\SQLAGENT. Last startup time: not shown.. Startup type: Manual, currently Running.

- SQL Server Last Restart  - Jul 24 2017 10:39AM

- SQL Server Service  - Version: 13.0.4435.0. Patch Level: SP1. Edition: Standard Edition (64-bit). AlwaysOn Enabled: 0. AlwaysOn Mgr Status: 2

- Virtual Server  - Type: (HYPERVISOR)

- Windows Version  - You're running a pretty modern version of Windows: Server 2012R2 era, version 6.3


**Priority 254: Rundate**:

 - Captain's log: stardate something and something...
Run Code Online (Sandbox Code Playgroud)

这是输出的前几行:

SELECT *
FROM sys.dm_exec_query_stats WITH (NOLOCK)
ORDER BY creation_time ASC
Run Code Online (Sandbox Code Playgroud)

如果我错了,请纠正我,但它表明在 15:38:48 左右发生了一些事情并删除了计划,并且由于某种原因它没有触及在 15:33 创建的计划。从 15:38:48.117 开始,计划再次开始缓存。

plan_handle creation_time   last_execution_time
0x0500070027F2492EC09D25603E00000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:33:55.160 2017-08-01 15:59:18.223
0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.117 2017-08-01 15:59:50.737
0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.120 2017-08-01 15:59:50.737
0x0500070091273D3040B458482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.120 2017-08-01 15:59:56.963
0x05000700CD381A65E0AD58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.127 2017-08-01 15:59:50.737
0x05000700AB256B35A0BC58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.133 2017-08-01 15:59:50.533
0x05000700B31CF10530BE58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.163 2017-08-01 15:59:58.687
0x0500070051C21425C0BE58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.167 2017-08-01 15:59:56.003
0x0500070055138E3B50BF58482900000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.177 2017-08-01 15:59:55.017
0x05000700ED867A0F30FFB4B53200000001000000000000000000000000000000000000000000000000000000  2017-08-01 15:38:48.180 2017-08-0

Bre*_*zar 8

你在这里有几个不同的问题:

“锁定内存中的页面”设置是否保留计划缓存?只是间接的。LPIM意味着SQL Server将不会页出到磁盘如果它在内存压力下,但SQL Server仍然放弃内存时,操作系统是在压力下。Jonathan Kehayias 对此进行了详细介绍。简短的回答是,当盒子处于压力之下时,给定 16GB 的最小内存设置,即使使用 LPIM,您仍然会看到 SQL Server 放弃内存(并减少计划缓存的内容)。

SQL 2016 SP1 CU3 是否会定期清除计划缓存?与其他版本没有什么不同。您的标题表明您在应用 CU 后看到了不同的行为 - 如果确实如此,最好的办法是卸载 CU(回到之前的 CU2),然后查看行为是否发生变化。不过,我敢打赌,您会看到最近刚开始的行为。

我如何查看正在清除计划缓存的部分内容?由于您写道计划缓存并未完全清除,因此您需要查看正在消失的计划中涉及哪些数据库/表/索引/统计信息。例如,您可能有一个表达到了触发自动统计更新的 20% 更改阈值。

由于这不是真正的紧急情况,因此我将向您提供我将采取的休闲、低劳动密集型方法来解决它。我会定期运行sp_BlitzCache(免责声明:我是共同作者之一,虽然我的代码很糟糕),了解前 10 名资源最密集的查询。(甚至可以使用 OutputDatabaseName、OutputSchemaName 和 OutputTableName 参数将其记录到表中。)然后,当计划缓存清除时,运行它并查看哪些计划消失了。他们有什么共同点,比如他们都打了一个主要的销售表吗?如果是这样,那张桌子上的变化率是什么样的?您可以使用rowmodctr粗略了解哪些表的更改频率。

由于您使用的是 SQL Sentry,您还可以基于此设置自定义监视器:

SELECT COUNT(*)
FROM sys.dm_exec_query_stats WITH (NOLOCK)
Run Code Online (Sandbox Code Playgroud)

这将为您提供一个折线图来确定您的计划缓存何时下降。如果它以非常特定的时间间隔(例如每小时)发生,那么我将使用此答案中技术来追踪导致它的原因。