SQL Server记录字符串匹配后的链接

Dat*_*ice 5 sql sql-server similarity

背景- 我有一组客户数据,并使用字符串匹配算法来比较所有记录的相似性。然后,我需要对直接或通过关联相互关联的结果进行分组,并为每个组应用唯一的 ID。

问题- 我想不出一种将记录链接在一起并为每个组应用唯一 ID 的方法

例子

目前已找到的匹配数据如下所示(MatchScore 与此处的问题无关,但用于演示数据的来源)。

+-------------+-------------+------------+
| CustomerID1 | CustomerID2 | MatchScore |
+-------------+-------------+------------+
|     2021000 |     2707799 | 0.075      |
|     2021000 |     3856308 | 0.082      |
|      774062 |      774063 | 0.041      |
|      998328 |     2278386 | 0.063      |
|      998328 |      998329 | 0.058      |
|      998329 |     2278386 | 0.030      |
+-------------+-------------+------------+
Run Code Online (Sandbox Code Playgroud)

底部 3 条记录都是链接的,因此我希望它们具有相同的关联 ID。

这些记录的视觉图像都是相关的

这就是我想要的数据的样子

+----+-------------+-------------+------------+
| ID | CustomerID1 | CustomerID2 | MatchScore |
+----+-------------+-------------+------------+
|  1 |      998328 |     2278386 | 0.063      |
|  1 |      998328 |      998329 | 0.058      |
|  1 |      998329 |     2278386 | 0.030      |
|  2 |     2021000 |     2707799 | 0.075      |
|  2 |     2021000 |     3856308 | 0.082      |
|  3 |      774062 |      774063 | 0.041      |
+----+-------------+-------------+------------+
Run Code Online (Sandbox Code Playgroud)

或类似地

+----+------------+
| ID | CustomerID |
+----+------------+
|  1 |    2278386 |
|  1 |     998328 |
|  1 |     998329 |
|  2 |    2021000 |
|  2 |    2707799 |
|  2 |    3856308 |
|  3 |     774062 |
|  3 |     774063 |
+----+------------+
Run Code Online (Sandbox Code Playgroud)

生成示例表的代码

select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
into #tmp
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore

select * from #tmp
Run Code Online (Sandbox Code Playgroud)

正如我所说,我无法想象如何将记录链接在一起,我尝试了各种连接,但尤里卡时刻从未到来。请你帮忙。

谢谢

Vik*_*888 2

我不确定这是否是您期望的结果,

with tmp as(
select '998328' as CustomerID1,'998329' as CustomerID2,'0.058' as MatchScore
union
select '998328' as CustomerID1,'2278386' as CustomerID2,'0.063' as MatchScore
union
select '998329' as CustomerID1,'2278386' as CustomerID2,'0.030' as MatchScore
union
select '2021000' as CustomerID1,'2707799' as CustomerID2,'0.075' as MatchScore
union
select '2021000' as CustomerID1,'3856308' as CustomerID2,'0.082' as MatchScore
union
select '774062' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774062' as CustomerID2,'0.041' as MatchScore
union
select '774063' as CustomerID1,'774063' as CustomerID2,'0.041' as MatchScore)


select DENSE_RANK() OVER(ORDER BY rank_value) id, t1.CustomerID1, t1.CustomerID2
from(
    select 
        t1.*, 
        case 
            when t2.CustomerID1 IS NOT NULL 
                THEN t2.CustomerID1 
            ELSE t3.CustomerID1 
        end rank_value

    from tmp t1
    left join tmp t2 
    on (t1.CustomerID1 = t2.CustomerID2 
            and t1.CustomerID2!=t2.CustomerID1 
            and (t1.CustomerID1 != t1.CustomerID2 and t2.CustomerID1 != t2.CustomerID2))
       or (t1.CustomerID1 = t2.CustomerID1 
             and t1.CustomerID2 != t2.CustomerID2 
             and (t1.CustomerID1 != t1.CustomerID2)) 
    left join tmp t3 
        on t1.CustomerID1 = t3.CustomerID2 
            and t1.CustomerID2=t3.CustomerID1
)t1
Run Code Online (Sandbox Code Playgroud)

我得到以下结果

在此输入图像描述

注:DENSE_RANK()该功能从2012版本开始可用