Ian*_*oyd 8 concurrency sql-server-2008-r2 locking
在高并发期间,我们遇到了返回无意义结果的查询的问题 - 结果违反了所发出查询的逻辑。需要一段时间才能重现该问题。我已经设法将可重现的问题归结为几个 T-SQL。
注意:有问题的实时系统部分由 5 个表、4 个触发器、2 个存储过程和 2 个视图组成。对于已发布的问题,我已将实际系统简化为更易于管理的系统。事情已经被削减,列被删除,存储过程被内联,视图变成了公共表表达式,列的值发生了变化。这是一个很长的说法,虽然下面的内容会重现错误,但可能更难以理解。您必须避免想知道为什么某些事物的结构是这样的。我在这里试图弄清楚为什么错误情况会在这个玩具模型中重复发生。
/*
The idea in this system is that people are able to take days off.
We create a table to hold these *"allocations"*,
and declare sample data that only **1** production operator
is allowed to take time off:
*/
IF OBJECT_ID('Allocations') IS NOT NULL DROP TABLE Allocations
CREATE TABLE [dbo].[Allocations](
JobName varchar(50) PRIMARY KEY NOT NULL,
Available int NOT NULL
)
--Sample allocation; there is 1 avaialable slot for this job
INSERT INTO Allocations(JobName, Available)
VALUES ('Production Operator', 1);
/*
Then we open up the system to the world, and everyone puts in for time.
We store these requests for time off as *"transactions"*.
Two production operators requested time off.
We create sample data, and note that one of the users
created their transaction first (by earlier CreatedDate):
*/
IF OBJECT_ID('Transactions') IS NOT NULL DROP TABLE Transactions;
CREATE TABLE [dbo].[Transactions](
TransactionID int NOT NULL PRIMARY KEY CLUSTERED,
JobName varchar(50) NOT NULL,
ApprovalStatus varchar(50) NOT NULL,
CreatedDate datetime NOT NULL
)
--Two sample transactions
INSERT INTO Transactions (TransactionID, JobName, ApprovalStatus, CreatedDate)
VALUES (52625, 'Production Operator', 'Booked', '20140125 12:00:40.820');
INSERT INTO Transactions (TransactionID, JobName, ApprovalStatus, CreatedDate)
VALUES (60981, 'Production Operator', 'WaitingList', '20150125 12:19:44.717');
/*
The allocation, and two sample transactions are now in the database:
*/
--Show the sample data
SELECT * FROM Allocations
SELECT * FROM Transactions
Run Code Online (Sandbox Code Playgroud)
交易都作为WaitingList
. 接下来,我们有一个定期运行的任务,它寻找空槽并将 WaitingList 上的任何人都变成 Booked 状态。
在单独的 SSMS 窗口中,我们有模拟的重复存储过程:
/*
Simulate recurring task that looks for empty slots,
and bumps someone on the waiting list into that slot.
*/
SET NOCOUNT ON;
--Reset the faulty row so we can continue testing
UPDATE Transactions SET ApprovalStatus = 'WaitingList'
WHERE TransactionID = 60981
--DBCC TRACEON(3604,1200,3916,-1) WITH NO_INFOMSGS
DECLARE @attempts int
SET @attempts = 0;
WHILE (@attempts < 1000000)
BEGIN
SET @attempts = @attempts+1;
/*
The concept is that if someone is already "Booked", then they occupy an available slot.
We compare the configured amount of allocations (e.g. 1) to how many slots are used.
If there are any slots leftover, then find the **earliest** created transaction that
is currently on the WaitingList, and set them to Booked.
*/
PRINT '=== Looking for someone to bump ==='
WITH AvailableAllocations AS (
SELECT
a.JobName,
a.Available AS Allocations,
ISNULL(Booked.BookedCount, 0) AS BookedCount,
a.Available-ISNULL(Booked.BookedCount, 0) AS Available
FROM Allocations a
FULL OUTER JOIN (
SELECT t.JobName, COUNT(*) AS BookedCount
FROM Transactions t
WHERE t.ApprovalStatus IN ('Booked')
GROUP BY t.JobName
) Booked
ON a.JobName = Booked.JobName
WHERE a.Available > 0
)
UPDATE Transactions SET ApprovalStatus = 'Booked'
WHERE TransactionID = (
SELECT TOP 1 t.TransactionID
FROM AvailableAllocations aa
INNER JOIN Transactions t
ON aa.JobName = t.JobName
AND t.ApprovalStatus = 'WaitingList'
WHERE aa.Available > 0
ORDER BY t.CreatedDate
)
IF EXISTS(SELECT * FROM Transactions WHERE TransactionID = 60981 AND ApprovalStatus = 'Booked')
begin
--DBCC TRACEOFF(3604,1200,3916,-1) WITH NO_INFOMSGS
RAISERROR('The later tranasction, that should never be booked, managed to get booked!', 16, 1)
BREAK;
END
END
Run Code Online (Sandbox Code Playgroud)
最后在第三个 SSMS 连接窗口中运行它。这模拟了一个并发问题,其中较早的事务从占用一个插槽到在等待列表中:
/*
Toggle the earlier transaction back to "WaitingList".
This means there are two possibilies:
a) the transaction is "Booked", meaning no slots are available.
Therefore nobody should get bumped into "Booked"
b) the transaction is "WaitingList",
meaning 1 slot is open and both tranasctions are "WaitingList"
The earliest transaction should then get "Booked" into the slot.
There is no time when there is an open slot where the
first transaction shouldn't be the one to get it - he got there first.
*/
SET NOCOUNT ON;
--Reset the faulty row so we can continue testing
UPDATE Transactions SET ApprovalStatus = 'WaitingList'
WHERE TransactionID = 60981
DECLARE @attempts int
SET @attempts = 0;
WHILE (@attempts < 100000)
BEGIN
SET @attempts = @attempts+1
/*Flip the earlier transaction from Booked back to WaitingList
Because it's now on the waiting list -> there is a free slot.
Because there is a free slot -> a transaction can be booked.
Because this is the earlier transaction -> it should always be chosen to be booked
*/
--DBCC TRACEON(3604,1200,3916,-1) WITH NO_INFOMSGS
PRINT '=== Putting the earlier created transaction on the waiting list ==='
UPDATE Transactions
SET ApprovalStatus = 'WaitingList'
WHERE TransactionID = 52625
--DBCC TRACEOFF(3604,1200,3916,-1) WITH NO_INFOMSGS
IF EXISTS(SELECT * FROM Transactions WHERE TransactionID = 60981 AND ApprovalStatus = 'Booked')
begin
RAISERROR('The later tranasction, that should never be booked, managed to get booked!', 16, 1)
BREAK;
END
END
Run Code Online (Sandbox Code Playgroud)
从概念上讲,碰撞过程一直在寻找任何空槽。如果找到一个,它会取 上最早的事务WaitingList
并将其标记为Booked
。
在没有并发的情况下进行测试时,逻辑有效。我们有两个交易:
有 1 个分配和 0 个预订交易,因此我们将较早的交易标记为已预订:
下次任务运行时,现在有 1 个插槽被占用 - 所以没有什么可更新的。
如果我们然后更新第一笔交易,并将其放到WaitingList
:
UPDATE Transactions SET ApprovalStatus='WaitingList'
WHERE TransactionID = 60981
Run Code Online (Sandbox Code Playgroud)
然后我们回到我们开始的地方:
注意:您可能想知道为什么我将交易放回等待名单。这是简化玩具模型的牺牲品。在真实系统中事务可以
PendingApproval
,它也占用一个槽。PendingApproval 事务在获得批准后被放入等待列表。没关系。别担心。
但是,当我引入并发性时,通过在第二个窗口中不断地将第一笔交易在预订后放回等待列表,然后后来的交易设法获得预订:
玩具测试脚本捕捉到这一点,并停止迭代:
Msg 50000, Level 16, State 1, Line 41
The later tranasction, that should never be booked, managed to get booked!
Run Code Online (Sandbox Code Playgroud)
问题是,为什么在这个玩具模型中,会触发这种纾困条件?
第一个交易的批准状态有两种可能的状态:
select
在最古老的交易(即ORDER BY CreatedDate
)的第一笔交易应该得到它。我了解到,一个UPDATE开始后,和数据已被修改,它可以读取旧值。在初始条件下:
Booked
Booked
然后我做一个更新,虽然聚集索引叶节点已被修改,但任何非聚集索引仍然包含原始值并且仍然可供读取:
Booked
WaitingList
Booked
但这并不能解释观察到的问题。是的,交易不再是Booked,这意味着现在有一个空槽。但是该更改尚未提交,它仍然是排他性的。如果碰撞程序运行,它将:
Booked
):如果启用了快照隔离无论哪种方式,碰撞作业都不会知道有一个空槽。
几天来,我们一直在努力弄清楚这些荒谬的结果是如何发生的。
您可能不了解原始系统,但有一组可重现的玩具脚本。当检测到无效案例时,他们就会退出。为什么会被检测到?为什么会发生?
纳斯达克如何解决这个问题?cavirtex 是怎么做的?mtgox如何?
有三个脚本块。将它们放入 3 个单独的 SSMS 选项卡中并运行它们。第二个和第三个脚本将引发错误。帮我弄清楚为什么会出现错误。
Pau*_*ite 12
默认的READ COMMITTED
事务隔离级别保证您的事务不会读取未提交的数据。它确实没有保证,如果你读一遍你读的任何数据都将保持不变(重复读),或新的数据将不会出现(幻影)。
这些相同的考虑适用于同一语句中的多个数据访问。
您的UPDATE
语句会生成一个Transactions
多次访问表的计划,因此它很容易受到不可重复读取和幻像造成的影响。
该计划有多种方式可以产生您在READ COMMITTED
孤立情况下不期望的结果。
第一个Transactions
表访问查找状态为 的行WaitingList
。第二次访问计算状态为 的条目数(对于同一作业)Booked
。第一次访问可能只返回较晚的事务(较早的事务Booked
此时)。当第二次(计数)访问发生时,较早的事务已更改为WaitingList
。因此,后面的行有资格更新Booked
状态。
有几种方法可以设置隔离语义以获得您想要的结果。一种选择是READ_COMMITTED_SNAPSHOT
为数据库启用。这为在默认隔离级别运行的语句提供语句级读取一致性。不可重复读和幻象在读提交快照隔离下是不可能的。
我不得不说,我不会以这种方式设计架构或查询。所涉及的工作比满足规定的业务要求所需的工作要多。也许这部分是问题简化的结果,无论如何这是一个单独的问题。
您所看到的行为并不代表任何类型的错误。给定请求的隔离语义,脚本会产生正确的结果。像这样的并发效应也不限于多次访问数据的计划。
已提交读隔离级别提供的保证比通常假设的要少得多。例如,跳过行和/或多次读取同一行是完全可能的。
归档时间: |
|
查看次数: |
4593 次 |
最近记录: |