Azure Kubernetes .NET Core 应用程序到 Azure SQL 数据库间歇性错误 258

And*_*sch 5 c# sql-server timeout azure kubernetes

我们正在 Kubernetes 集群中运行 .NET Core 3.1 应用程序。该应用程序使用 EF Core 3.1.7 和 Microsoft.Data.SqlClient 1.1.3 连接到 Azure SQL 数据库。

在看似随机的时间,我们会收到以下错误。

 ---> System.Data.SqlClient.SqlException (0x80131904): Timeout expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at System.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at System.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at System.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at System.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at System.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at System.Data.SqlClient.SqlDataReader.get_MetaData()
   at System.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString)
   at System.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, SqlDataReader ds)
   at System.Data.SqlClient.SqlCommand.ExecuteScalar()
Run Code Online (Sandbox Code Playgroud)

尽管它看起来是随机的,但在较重的负载下肯定会更频繁地发生。从我的研究来看,似乎这个特定的超时与连接超时而不是命令超时有关。即客户端根本无法建立连接。这不是超时的查询。

我们已消除的潜在根本原因:

  • Azure SQL Server 容量:无论我们在 4 个还是 16 个 vCPU 上运行,都会观察到该行为。Azure 支持还确认日志中没有问题。这包括打开的连接数,只有大约 50 个。我们还从其他连接运行了负载测试,服务器保持良好。
  • Microsoft.Data.SqlClient 版本:我们一直在 1.1.3 版上运行,并且这种行为仅在一周前 (2021-03-16) 开始。
  • 网络容量:在这个阶段,我们的最大速度在 1-2MB/s 左右,这是非常普通的。
  • Kubernetes Scaling:事件的发生与我们何时扩展更多 Pod 之间没有关联。
  • 连接字符串问题:我们的系统过去运行良好,但无论如何我们更改了其他文章中提到的一些设置,以查看问题是否不会自行解决。火星被禁用。我们不能禁用连接池。我们已经TrusServerCertificate设置为true。这是当前的连接字符串:Server=tcp:***.database.windows.net,1433;Initial Catalog=***;Persist Security Info=False;User ID=***;Password=***;MultipleActiveResultSets=False;Encrypt=True;Connection Timeout=60;TrustServerCertificate=True;

更新 1: 根据要求,刚刚发生的两次超时的示例。今天是星期天,所以流量非常低。数据库利用率(CPU、Mem、IO)介于 2-6% 之间。

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniError(TdsParserStateObject stateObj, UInt32 error)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ReadSniSyncOverAsync()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadNetworkPacket()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryPrepareBuffer()
   at Microsoft.Data.SqlClient.TdsParserStateObject.TryReadByte(Byte& value)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj)
   at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest)
   at Microsoft.Data.SqlClient.SqlInternalConnection.BeginSqlTransaction(IsolationLevel iso, String transactionName, Boolean shouldReconnect)
   at Microsoft.Data.SqlClient.SqlConnection.BeginTransaction(IsolationLevel iso, String transactionName)
   at Microsoft.Data.SqlClient.SqlConnection.BeginDbTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.Storage.RelationalConnection.BeginTransaction(IsolationLevel isolationLevel)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerExecutionStrategy.Execute[TState,TResult](TState state, Func`3 operation, Func`3 verifySucceeded)
Run Code Online (Sandbox Code Playgroud)

使用此命令时,我们的数据库运行状况检查器也收到错误消息: Microsoft.EntityFrameworkCore.Infrastructure.DatabaseFacade.CanConnect()

上面的堆栈跟踪是我们试图解决的问题,而不是 SQL 查询超时下面的这个堆栈跟踪。

Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired.  The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady)
   at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData()
   at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData()
   at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean isAsync, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest)
   at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry, String method)
   at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior)
Run Code Online (Sandbox Code Playgroud)

And*_*sch 6

问题是Azure的基础设施问题。

Azure 网络中存在一个已知问题,即只要在某些 VM 队列上发生磁盘附加/分离,dhcp 租约就会丢失。目前正在向地区推出修复程序。我将查看何时为此发布 Azure 状态更新。

问题消失了,因此似乎修复程序已在全球范围内推出。

对于将来遇到此问题的任何其他人,您可以通过在节点(而不是 pod)中建立SSH 连接来识别它。执行ls -al /var/log/并识别所有syslog文件并在每个文件上运行以下 grep。

cat /var/log/syslog | grep 'carrier'
Run Code Online (Sandbox Code Playgroud)

如果日志中有任何Lost carrierGained carrier消息,则存在某种网络问题。在我们的例子中,它是 DHCP 租用。

在此处输入图片说明