Dat*_*ice 5 sql sql-server similarity
背景- 我有一组客户数据,并使用字符串匹配算法来比较所有记录的相似性。然后,我需要对直接或通过关联相互关联的结果进行分组,并为每个组应用唯一的 ID。
问题- 我想不出一种将记录链接在一起并为每个组应用唯一 ID 的方法
例子
目前已找到的匹配数据如下所示(MatchScore 与此处的问题无关,但用于演示数据的来源)。
+-------------+-------------+------------+
| CustomerID1 | CustomerID2 | MatchScore |
+-------------+-------------+------------+
| 2021000 | 2707799 | 0.075 |
| 2021000 | 3856308 | 0.082 |
| 774062 | 774063 | 0.041 |
| 998328 | 2278386 | 0.063 |
| 998328 | 998329 | 0.058 |
| 998329 | 2278386 | 0.030 |
+-------------+-------------+------------+
Run Code Online (Sandbox Code Playgroud)
底部 3 条记录都是链接的,因此我希望它们具有相同的关联 ID。
这就是我想要的数据的样子
+----+-------------+-------------+------------+
| ID | CustomerID1 | CustomerID2 | MatchScore |
+----+-------------+-------------+------------+
| 1 | 998328 | 2278386 | 0.063 |
| 1 | 998328 | 998329 | 0.058 |
| 1 | 998329 | 2278386 | 0.030 |
| 2 | 2021000 | 2707799 | 0.075 |
| 2 | 2021000 | 3856308 | 0.082 |
| 3 | 774062 | 774063 | 0.041 |
+----+-------------+-------------+------------+
Run Code Online (Sandbox Code Playgroud)
或类似地
+----+------------+
| ID | CustomerID |
+----+------------+
| 1 | 2278386 |
| 1 | 998328 |
| 1 | 998329 |
| 2 | 2021000 |
| 2 | 2707799 |
| 2 | 3856308 |
| 3 | 774062 |
| 3 | 774063 |
+----+------------+
Run Code Online (Sandbox Code Playgroud)
生成示例表的代码
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
into #tmp
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
select * from #tmp
Run Code Online (Sandbox Code Playgroud)
正如我所说,我无法想象如何将记录链接在一起,我尝试了各种连接,但尤里卡时刻从未到来。请你帮忙。
谢谢
我不确定这是否是您期望的结果,
with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore)
select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2
from(
select
t1.*,
case
when t2.CustomerID1 IS NOT NULL
THEN t2.CustomerID1
ELSE t3.CustomerID1
end rank_value
from tmp t1
left join tmp t2
on (t1.CustomerID1 = t2.CustomerID2
and t1.CustomerID2!=t2.CustomerID1
and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2))
or (t1.CustomerID1 = t2.CustomerID1
and t1.CustomerID2 != t2.CustomerID2
and (t1.CustomerID1 != t1.CustomerID2))
left join tmp t3
on t1.CustomerID1 = t3.CustomerID2
and t1.CustomerID2=t3.CustomerID1
)t1
Run Code Online (Sandbox Code Playgroud)
我得到以下结果
注:DENSE_RANK()该功能从2012版本开始可用