mda*_*man 5 sql oracle fuzzy-search match abbreviation
我有来自两个不同来源的脏数据.我正在寻找一些匹配它们的最佳实践.以下是一些数据示例:
Source1.Name Source2.Name
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School Alice Fong Yu Alternative School
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School
Rosa Parks Elementary School Rosa Parks Elementary School
Run Code Online (Sandbox Code Playgroud)
人类可以看到这4个例子应该与理想的模糊匹配相匹配.我拥有传统模糊匹配的优秀软件,可以捕捉拼写错误和其他小变化.但是在这个数据集中,我有大约十几个规则来管理缩写,比如'Preparatory' - >'Prep'.我想在查询中捕获所有这些规则.(然后我将分别处理更传统的模糊性.)
是否有一个众所周知的SQL模式来处理这个要求?它可以像学习magic关键字一样简单,它将解锁我的搜索中的示例.这是一种"翻译表"或"缩写表",但我只是提出了这些条款.我还没有找到被广泛接受的术语.
从概念上讲,我的目标是从这个天真的查询开始:
/* This succeeds for 1 record and fails for 3 in the sample data set above. */
SELECT * FROM ...
WHERE Source1.Name = Source2.Name
Run Code Online (Sandbox Code Playgroud)
然后将其修改为获得上面显示的所有所需匹配的内容.我希望我能用一些嵌套的REPLACE函数来强制它:
/* This works for the 4 samples given */
SELECT * FROM ...
WHERE
REPLACE( REPLACE( REPLACE( Source1.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
= REPLACE( REPLACE( REPLACE( Source2.Name, 'Preparatory', 'Prep' ), 'Alternative', 'Alt' ), 'Elementary School', 'Es' )
Run Code Online (Sandbox Code Playgroud)
这并不优雅.由于我考虑到不一致的缩写(例如'国际'有时'Intl',有时'Int''l'),它的丑陋性越来越大.重叠缩写并不是特别顺利(例如'小学' - >'Es',但在其他情况下'学校' - >'Sch').
其他人如何解决这个问题?
注意:我正在使用Oracle.我可能会使用REGEXP_REPLACE而不是REPLACE.我当然会使用UPPER(或LOWER)来避免案件问题.但这些细节并不是问题的核心.
如果您有一组已知的翻译,您可以创建一个捕获这些翻译的函数。然后,您可以在表上创建一个虚拟列来返回结果。然后,您可以比较虚拟列,从而简化查询:
create or replace function abbr_replace ( str varchar2 )
return varchar2 deterministic as
begin
return replace(
replace(
replace(
replace(
replace( lower( str ), 'preparatory', 'prep' ),
'junior', 'jr'),
'elementary school', 'es'),
'alternative', 'alt' ),
'elementary school', 'es'
);
end abbr_replace;
/
create table source1 (
name varchar2(100),
replace_name varchar2(100) as (
cast ( abbr_replace ( name ) as varchar2(100) )
)
);
create table source2 (
name varchar2(100),
replace_name varchar2(100) as (
cast ( abbr_replace ( name ) as varchar2(100) )
)
);
insert into source1 (name) values ('Adda Clevenger Jr Prep School');
insert into source1 (name) values ('Alice Fong Yu Alt School');
insert into source1 (name) values ('Convent Of Sacred Heart Es');
insert into source1 (name) values ('Rosa Parks Elementary School');
insert into source2 (name) values ('Adda Clevenger Junior Preparatory School');
insert into source2 (name) values ('Alice Fong Yu Alternative School');
insert into source2 (name) values ('Convent of Sacred Heart Elementary School');
insert into source2 (name) values ('Rosa Parks Elementary School');
commit;
select s1.name, s2.name
from source1 s1
join source2 s2
on s2.replace_name = s1.replace_name;
NAME NAME
-------------------------------------------------- --------------------------------------------------
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School
Alice Fong Yu Alt School Alice Fong Yu Alternative School
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School
Rosa Parks Elementary School Rosa Parks Elementary School
Run Code Online (Sandbox Code Playgroud)
有几点需要注意:
deterministic
如果您正在寻找更通用的模糊匹配,Oracle 已经实现了 Levenshtein Distance 和 Jaro-Winkler 匹配算法。这些在 utl_match 中:
select s1.name, s2.name, utl_match.jaro_winkler(s1.name, s2.name) jw
from source1 s1
join source2 s2
on utl_match.jaro_winkler(s1.name, s2.name) > .9;
NAME NAME JW
-------------------------------------------------- -------------------------------------------------- --
Adda Clevenger Jr Prep School Adda Clevenger Junior Preparatory School 0.904
Alice Fong Yu Alt School Alice Fong Yu Alternative School 0.925
Convent Of Sacred Heart Es Convent of Sacred Heart Elementary School 0.902
Rosa Parks Elementary School Rosa Parks Elementary School 1.000
Run Code Online (Sandbox Code Playgroud)
LiveSQL上也提供了脚本
归档时间: |
|
查看次数: |
694 次 |
最近记录: |