如何修复:“在 MS SQL Server 重新启动使用它的 alipne .net core 2.2.5 容器后,有一个 CLOSE_WAIT tcp 连接 - dotnet 的 CPU 上升”

Szy*_*ski 8 c# sql docker .net-core alpine-linux

我对托管在官方 asp-dotnet-core-alipne 容器和其他服务器上的 SQL Server 上的应用程序有问题。重启我们有 SQL 的服务器后,容器获得高 CPU 和一些内部线程挂起。仅帮助重启容器。我们调查了在这种情况发生时,有一些 tcp 连接处于 CLOSE_WAIT 状态。关于应用程序和服务器的一些信息:

关于应用程序的详细信息:

  • .net 核心 2.2 (C#)
  • 基于官方高山容器 (mcr.net.core.asp:2.2.5-alpine3.9)
  • 使用 Dapper 以及 ADO.NET
  • 使用带有 async/await 的 Async 方法
  • 使用 Quartz.NET 进行作业调度

有关 Docker 的详细信息

  • 托管在 Centos 7
  • 使用 Docker 19.03

SQL Server 的详细信息:

  • MS SQL 2014 标准 x64
  • 视窗服务器 2012R2
  • 托管在虚拟机上

详细问题描述:

带有应用程序的容器在安装了 docker 的 linux 服务器(Centos 7)上 24/7 全天候工作。在同一台服务器上有安装了 Windows Server 和 MS SQL Server 2014 的虚拟机。如果存在一些网络问题和其他问题,应用程序可以正常工作,但在重新启动此服务器后,我收到错误消息:

[19-08-09 04:15:44.15 ERR -- SSI.Pojazd 0216NPIK] Job SSI.Pojazd.retry_Sms_RetryJob`1 threw an unhandled Exception: 
System.Data.SqlClient.SqlException (0x80131904): SHUTDOWN is in progress.
Login failed for user 'XXXX'.
Cannot continue the execution because the session is in the kill state.
A severe error occurred on the current command.  The results, if any, should be discarded.
   at System.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at System.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at System.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Transport.Core.Abstractions.Database.TransportStoredProcedureWithResult`2.Execute(TInput input)
   at Transport.Jobs.RetriesJobs.RetryJob`1.Execute(IJobExecutionContext context)
   at Quartz.Core.JobRunShell.Run(CancellationToken cancellationToken)
ClientConnectionId:1d7f1d2d-f80b-42cf-906f-f0ee57b14f59
Error Number:6005,State:1,Class:14
Run Code Online (Sandbox Code Playgroud)

在此错误之后,一切都开始变得奇怪:

  • 容器的 CPU 从 0.9 CPU 上升到 100 CPU 甚至更多
  • dotnet 进程消耗 CPU(在托管 linux 进程上)从 1-2 核上升到 12-14 核(保护有多少应用程序在工作)
  • HealthCheck 有时会挂起 -> 使用简单的 CURL 查询时服务器无应答
  • 有 CLOSE_WAIT tcp 连接到 SQL
  • 如果应用程序使用 Quartz.NET 作业,则停止触发(禁用并发)并在最后一次调用时挂起

TCP连接日志:

tcp        0      0 173.25.0.2:44920        10.6.67.122:5672        ESTABLISHED
tcp        0      0 173.25.0.2:47488        10.12.128.12:1433       ESTABLISHED
tcp        0      0 173.25.0.2:47246        10.12.128.12:1433       ESTABLISHED
tcp        0      0 173.25.0.2:46785        10.6.67.122:5672        ESTABLISHED
tcp        0      0 173.25.0.2:45556        10.12.128.12:1433       CLOSE_WAIT
tcp        0      0 173.25.0.2:45520        10.12.128.12:1433       CLOSE_WAIT 
Run Code Online (Sandbox Code Playgroud)

我试图做的事情:

  • 使用 CancellationToken 作为在 SqlClient.dll 中取消 DbOperation 的某种方式

      [Dapper code]
      public async Task Execute(TInput input, IScope scope,CancellationToken token)
      {
          await scope.GetConnection().ExecuteAsync(new CommandDefinition(GetStoredProcedureName(), GetParameters(input), scope.GetTransaction(),
          commandType: CommandType.StoredProcedure,cancellationToken:token)).ConfigureAwait(false);
      }
    
      [ADO.NET CODE]
      public async Task CheckAsync(string connectionString, int timeout, CancellationToken cancellationToken)
      {
          try
          {
              SqlConnectionStringBuilder connectionStringBuilder =
                  new SqlConnectionStringBuilder(connectionString) {ConnectTimeout = 2};
    
              using (var conn = new SqlConnection(connectionStringBuilder.ToString()))
              {
                  try
                  {
                      if (cancellationToken.IsCancellationRequested)
                          cancellationToken.ThrowIfCancellationRequested();
                      await conn.OpenAsync(cancellationToken).ConfigureAwait(false);
                      using (var cmd = conn.CreateCommand())
                      {
                          cmd.CommandText = "Select 1";
                          cmd.CommandTimeout = timeout;
                          cancellationToken.ThrowIfCancellationRequested();
                          await cmd.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
                      }
                  }
                  catch (TaskCanceledException tce)
                  {
                      TransportLogger.Log.Debug("Task cancelled " + tce.Message + tce.StackTrace + " " + (tce.InnerException == null ? string.Empty : tce.InnerException.Message));
                      throw;
                  }
                  finally
                  {
                      if (conn.State == ConnectionState.Open) conn.Close();
                  }
    
    
              }
          }
          catch (Exception ex)
          {
              TransportLogger.Log.Error(ex, "Cannot check db assebility");
              throw;
          }
      }
    
    Run Code Online (Sandbox Code Playgroud)
  • 使用 ClearPool 和 ClearAllPool 方法清除池

      public class CheckDatabaseIsAccessible : ICheckDatabaseIsAccessible
      {
          public async Task CheckAsync(string connectionString, int timeout, CancellationToken cancellationToken)
          {
              try
              {
                  SqlConnectionStringBuilder connectionStringBuilder =
                      new SqlConnectionStringBuilder(connectionString) {ConnectTimeout = 2};
    
                  using (var conn = new SqlConnection(connectionStringBuilder.ToString()))
                  {
                      try
                      {
                          if (cancellationToken.IsCancellationRequested)
                              cancellationToken.ThrowIfCancellationRequested();
                          await conn.OpenAsync(cancellationToken).ConfigureAwait(false);
                          using (var cmd = conn.CreateCommand())
                          {
                              cmd.CommandText = "Select 1";
                              cmd.CommandTimeout = timeout;
                              cancellationToken.ThrowIfCancellationRequested();
                              await cmd.ExecuteNonQueryAsync(cancellationToken).ConfigureAwait(false);
                          }
                      }
                      catch (TaskCanceledException tce)
                      {
                          TransportLogger.Log.Debug("Task cancelled " + tce.Message + tce.StackTrace + " " + (tce.InnerException == null ? string.Empty : tce.InnerException.Message));
                          throw;
                      }
                      finally
                      {
                          if (conn.State == ConnectionState.Open) conn.Close();
                      }
    
    
                  }
              }
              catch (Exception ex)
              {
                  TransportLogger.Log.Error(ex, "Cannot check db assebility");
                  throw;
              }
          }
    
          public void Check(string connectionString, int timeout)
          {
              try
              {
                  SqlConnectionStringBuilder connectionStringBuilder =
                      new SqlConnectionStringBuilder(connectionString) {ConnectTimeout = 2};
    
                  using (var conn = new SqlConnection(connectionStringBuilder.ToString()))
                  {
                      try
                      {
                          TransportLogger.Log.Debug("Open connection");
                          conn.Open();
                          TransportLogger.Log.Debug("connection open");
                          using (var cmd = conn.CreateCommand())
                          {
                              cmd.CommandText = "Select 1";
                              cmd.CommandTimeout = timeout;
                              TransportLogger.Log.Debug("Command executing...");
                              cmd.ExecuteNonQuery();
                              TransportLogger.Log.Debug("Command executed");
                          }
                      }
                      catch (TaskCanceledException tce)
                      {
                          TransportLogger.Log.Debug("Task cancelled " + tce.Message + tce.StackTrace + " " + (tce.InnerException == null ? string.Empty : tce.InnerException.Message));
                          throw;
                      }
                      finally
                      {
                          if (conn.State == ConnectionState.Open) conn.Close();
                      }
    
    
                  }
              }
              catch (Exception ex)
              {
                  TransportLogger.Log.Error(ex, "Cannot check db assebility");
                  throw;
              }
          }
      }
    
    Run Code Online (Sandbox Code Playgroud)

池已清除但没有任何反应,问题仍然存在:

  [19-08-09 04:16:37.16 DBG -- SSI.Pojazd 0216NPIK] Task cancelled A task was canceled.   at Transport.Core.Abstractions.Database.CheckDatabaseIsAccessible.CheckAsync(String connectionString, Int32 timeout, CancellationToken cancellationToken) 
  [19-08-09 04:16:37.16 INF -- SSI.Pojazd 0216NPIK] Try to clear pool 
  [19-08-09 04:16:37.16 INF -- SSI.Pojazd 0216NPIK] Pool cleared 
Run Code Online (Sandbox Code Playgroud)

我尝试在测试环境中重现这个,但我失败了。我们有大约 400 台服务器,这种情况很常见。也许有人有这个问题,他知道解决方案?

我还读到: