pok*_*oke 9 sql-server-2008 database-design sql-server relational-division
假设我有这样的结构:
RecipeID
Name
Description
Run Code Online (Sandbox Code Playgroud)
RecipeID
IngredientID
Quantity
UOM
Run Code Online (Sandbox Code Playgroud)
关键RecipeIngredients是(RecipeID, IngredientID)。
有哪些查找重复食谱的好方法?重复配方被定义为具有完全相同的一组成分和每种成分的数量。
我想过使用FOR XML PATH将成分组合到一个列中。我还没有完全探索这个,但如果我确保成分/UOM/数量按相同的顺序排序并且有一个合适的分隔符,它应该可以工作。有更好的方法吗?
有 48K 食谱和 200K 成分行。
ype*_*eᵀᴹ 10
这是关系除法问题的概括。不知道这会有多高效:
; WITH cte AS
( SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
RecipeID_2 = r2.RecipeID, Name_2 = r2.Name
FROM Recipes AS r1
JOIN Recipes AS r2
ON r1.RecipeID <> r2.RecipeID
WHERE NOT EXISTS
( SELECT 1
FROM RecipeIngredients AS ri1
WHERE ri1.RecipeID = r1.RecipeID
AND NOT EXISTS
( SELECT 1
FROM RecipeIngredients AS ri2
WHERE ri2.RecipeID = r2.RecipeID
AND ri1.IngredientID = ri2.IngredientID
AND ri1.Quantity = ri2.Quantity
AND ri1.UOM = ri2.UOM
)
)
)
SELECT c1.*
FROM cte AS c1
JOIN cte AS c2
ON c1.RecipeID_1 = c2.RecipeID_2
AND c1.RecipeID_2 = c2.RecipeID_1
AND c1.RecipeID_1 < c1.RecipeID_2;
Run Code Online (Sandbox Code Playgroud)
另一种(类似的)方法:
SELECT RecipeID_1 = r1.RecipeID, Name_1 = r1.Name,
RecipeID_2 = r2.RecipeID, Name_2 = r2.Name
FROM Recipes AS r1
JOIN Recipes AS r2
ON r1.RecipeID < r2.RecipeID
AND NOT EXISTS
( SELECT IngredientID, Quantity, UOM
FROM RecipeIngredients AS ri1
WHERE ri1.RecipeID = r1.RecipeID
EXCEPT
SELECT IngredientID, Quantity, UOM
FROM RecipeIngredients AS ri2
WHERE ri2.RecipeID = r2.RecipeID
)
AND NOT EXISTS
( SELECT IngredientID, Quantity, UOM
FROM RecipeIngredients AS ri2
WHERE ri2.RecipeID = r2.RecipeID
EXCEPT
SELECT IngredientID, Quantity, UOM
FROM RecipeIngredients AS ri1
WHERE ri1.RecipeID = r1.RecipeID
) ;
Run Code Online (Sandbox Code Playgroud)
另一个,不同的:
; WITH cte AS
( SELECT RecipeID_1 = r.RecipeID, RecipeID_2 = ri.RecipeID,
ri.IngredientID, ri.Quantity, ri.UOM
FROM Recipes AS r
CROSS JOIN RecipeIngredients AS ri
)
, cte2 AS
( SELECT RecipeID_1, RecipeID_2,
IngredientID, Quantity, UOM
FROM cte
EXCEPT
SELECT RecipeID_2, RecipeID_1,
IngredientID, Quantity, UOM
FROM cte
)
SELECT RecipeID_1 = r1.RecipeID, RecipeID_2 = r2.RecipeID
FROM Recipes AS r1
JOIN Recipes AS r2
ON r1.RecipeID < r2.RecipeID
EXCEPT
SELECT RecipeID_1, RecipeID_2
FROM cte2
EXCEPT
SELECT RecipeID_2, RecipeID_1
FROM cte2 ;
Run Code Online (Sandbox Code Playgroud)
在SQL-Fiddle测试
使用CHECKSUM()和CHECKSUM_AGG()功能测试在SQL-小提琴-2 :
(忽略这个,因为它可能产生假阳性)
ALTER TABLE RecipeIngredients
ADD ck AS CHECKSUM( IngredientID, Quantity, UOM )
PERSISTED ;
CREATE INDEX ckecksum_IX
ON RecipeIngredients
( RecipeID, ck ) ;
; WITH cte AS
( SELECT RecipeID,
cka = CHECKSUM_AGG(ck)
FROM RecipeIngredients AS ri
GROUP BY RecipeID
)
SELECT RecipeID_1 = c1.RecipeID, RecipeID_2 = c2.RecipeID
FROM cte AS c1
JOIN cte AS c2
ON c1.cka = c2.cka
AND c1.RecipeID < c2.RecipeID ;
Run Code Online (Sandbox Code Playgroud)
对于以下假设的架构和示例数据
CREATE TABLE dbo.RecipeIngredients
(
RecipeId INT NOT NULL ,
IngredientID INT NOT NULL ,
Quantity INT NOT NULL ,
UOM INT NOT NULL ,
CONSTRAINT RecipeIngredients_PK
PRIMARY KEY ( RecipeId, IngredientID ) WITH (IGNORE_DUP_KEY = ON)
) ;
INSERT INTO dbo.RecipeIngredients
SELECT TOP (210000) ABS(CRYPT_GEN_RANDOM(8)/50000),
ABS(CRYPT_GEN_RANDOM(8) % 100),
ABS(CRYPT_GEN_RANDOM(8) % 10),
ABS(CRYPT_GEN_RANDOM(8) % 5)
FROM master..spt_values v1,
master..spt_values v2
SELECT DISTINCT RecipeId, 'X' AS Name
INTO Recipes
FROM dbo.RecipeIngredients
Run Code Online (Sandbox Code Playgroud)
这填充了 205,009 个成分行和 42,613 个食谱。由于随机元素,每次都会略有不同。
它假设相对较少的重复(示例运行后的输出是 217 个重复的配方组,每组有两个或三个配方)。根据 OP 中的数字,最病态的情况是 48,000 个完全相同的副本。
设置它的脚本是
DROP TABLE dbo.RecipeIngredients,Recipes
GO
CREATE TABLE Recipes(
RecipeId INT IDENTITY,
Name VARCHAR(1))
INSERT INTO Recipes
SELECT TOP 48000 'X'
FROM master..spt_values v1,
master..spt_values v2
CREATE TABLE dbo.RecipeIngredients
(
RecipeId INT NOT NULL ,
IngredientID INT NOT NULL ,
Quantity INT NOT NULL ,
UOM INT NOT NULL ,
CONSTRAINT RecipeIngredients_PK
PRIMARY KEY ( RecipeId, IngredientID )) ;
INSERT INTO dbo.RecipeIngredients
SELECT RecipeId,IngredientID,Quantity,UOM
FROM Recipes
CROSS JOIN (SELECT 1,1,1 UNION ALL SELECT 2,2,2 UNION ALL SELECT 3,3,3 UNION ALL SELECT 4,4,4) I(IngredientID,Quantity,UOM)
Run Code Online (Sandbox Code Playgroud)
对于这两种情况,以下在我的机器上不到一秒钟就完成了。
CREATE TABLE #Concat
(
RecipeId INT,
concatenated VARCHAR(8000),
PRIMARY KEY (concatenated, RecipeId)
)
INSERT INTO #Concat
SELECT R.RecipeId,
ISNULL(concatenated, '')
FROM Recipes R
CROSS APPLY (SELECT CAST(IngredientID AS VARCHAR(10)) + ',' + CAST(Quantity AS VARCHAR(10)) + ',' + CAST(UOM AS VARCHAR(10)) + ','
FROM dbo.RecipeIngredients RI
WHERE R.RecipeId = RecipeId
ORDER BY IngredientID
FOR XML PATH('')) X (concatenated);
WITH C1
AS (SELECT DISTINCT concatenated
FROM #Concat)
SELECT STUFF(Recipes, 1, 1, '')
FROM C1
CROSS APPLY (SELECT ',' + CAST(RecipeId AS VARCHAR(10))
FROM #Concat C2
WHERE C1.concatenated = C2.concatenated
ORDER BY RecipeId
FOR XML PATH('')) R(Recipes)
WHERE Recipes LIKE '%,%,%'
DROP TABLE #Concat
Run Code Online (Sandbox Code Playgroud)
一个警告
我假设连接字符串的长度不会超过 896 个字节。如果这样做,这将在运行时引发错误而不是静默失败。您需要从#temp表中删除主键(和隐式创建的索引)。我的测试设置中连接字符串的最大长度为 125 个字符。
如果连接的字符串太长而无法索引,那么XML PATH合并相同配方的最终查询的性能可能很差。安装和使用自定义 CLR 字符串聚合将是一种解决方案,因为它可以通过一次数据而不是非索引自连接进行连接。
SELECT YourClrAggregate(RecipeId)
FROM #Concat
GROUP BY concatenated
Run Code Online (Sandbox Code Playgroud)
我也试过
WITH Agg
AS (SELECT RecipeId,
MAX(IngredientID) AS MaxIngredientID,
MIN(IngredientID) AS MinIngredientID,
SUM(IngredientID) AS SumIngredientID,
COUNT(IngredientID) AS CountIngredientID,
CHECKSUM_AGG(IngredientID) AS ChkIngredientID,
MAX(Quantity) AS MaxQuantity,
MIN(Quantity) AS MinQuantity,
SUM(Quantity) AS SumQuantity,
COUNT(Quantity) AS CountQuantity,
CHECKSUM_AGG(Quantity) AS ChkQuantity,
MAX(UOM) AS MaxUOM,
MIN(UOM) AS MinUOM,
SUM(UOM) AS SumUOM,
COUNT(UOM) AS CountUOM,
CHECKSUM_AGG(UOM) AS ChkUOM
FROM dbo.RecipeIngredients
GROUP BY RecipeId)
SELECT A1.RecipeId AS RecipeId1,
A2.RecipeId AS RecipeId2
FROM Agg A1
JOIN Agg A2
ON A1.MaxIngredientID = A2.MaxIngredientID
AND A1.MinIngredientID = A2.MinIngredientID
AND A1.SumIngredientID = A2.SumIngredientID
AND A1.CountIngredientID = A2.CountIngredientID
AND A1.ChkIngredientID = A2.ChkIngredientID
AND A1.MaxQuantity = A2.MaxQuantity
AND A1.MinQuantity = A2.MinQuantity
AND A1.SumQuantity = A2.SumQuantity
AND A1.CountQuantity = A2.CountQuantity
AND A1.ChkQuantity = A2.ChkQuantity
AND A1.MaxUOM = A2.MaxUOM
AND A1.MinUOM = A2.MinUOM
AND A1.SumUOM = A2.SumUOM
AND A1.CountUOM = A2.CountUOM
AND A1.ChkUOM = A2.ChkUOM
AND A1.RecipeId <> A2.RecipeId
WHERE NOT EXISTS (SELECT *
FROM (SELECT *
FROM RecipeIngredients
WHERE RecipeId = A1.RecipeId) R1
FULL OUTER JOIN (SELECT *
FROM RecipeIngredients
WHERE RecipeId = A2.RecipeId) R2
ON R1.IngredientID = R2.IngredientID
AND R1.Quantity = R2.Quantity
AND R1.UOM = R2.UOM
WHERE R1.RecipeId IS NULL
OR R2.RecipeId IS NULL)
Run Code Online (Sandbox Code Playgroud)
当重复相对较少时(第一个示例数据不到一秒),这可以接受,但在病理情况下表现不佳,因为初始聚合为每个返回完全相同的结果RecipeID,因此无法减少数量完全比较。