Yes*_*yyy 4 r jaro-winkler record-linkage
我的小标题只有 1 列,称为“标题”。
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
Run Code Online (Sandbox Code Playgroud)
我希望找到类似的记录并将它们分组在一起(同时保留它们的索引):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
Run Code Online (Sandbox Code Playgroud)
我使用下面的函数来查找彼此足够接近的字符串(cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
Run Code Online (Sandbox Code Playgroud)
我已经实现了下面的循环,以在新的数据框中一起“发送”类似的记录,但它无法正常工作,我尝试了一些变体,但还没有任何效果。
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
Run Code Online (Sandbox Code Playgroud)
(我已在循环中插入打印“以查看发生了什么”)
可重现的数据:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
Run Code Online (Sandbox Code Playgroud)
请问有什么建议吗?编辑:我还想要一个名为“group”的新索引列,如下所示:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing
Run Code Online (Sandbox Code Playgroud)
小智 9
恐怕我从未尝试过RecordLinkage,但如果您只是使用 Jaro-Winkler 距离,那么将类似的字符串与包聚集起来也应该相当容易stringdist。使用你的dput上面:
library(tidyverse)
library(stringdist)
map_dfr(dat$title, ~ {
i <- which(stringdist(., dat$title, "jw") < 0.40)
tibble(index = i, title = dat$title[i])
}, .id = "group") %>%
distinct(index, .keep_all = T) %>%
mutate(group = as.integer(group))
Run Code Online (Sandbox Code Playgroud)
解释:
map_dfr迭代 中的每个字符串dat$title,提取由 计算得出的最接近匹配的索引stringdist(受 0.40 约束,即您的“阈值”),使用索引和匹配创建一个小标题,然后使用与group整数位置相对应的变量堆叠这些小标题原始字符串的(和行号)。distinct然后根据 的重复删除任何集群重复项index。
输出:
# A tibble: 13 x 3
group index title
<int> <int> <chr>
1 1 1 lymphoedema clinic
2 1 8 lymphoedema early intervenstion
3 1 9 lymphoedema expo
4 1 10 lymphoedema for breast care nurses
5 2 2 zostavax shingles vaccine
6 2 5 zostavax recall toolkit
7 2 11 xray meeting and case studies
8 3 3 xray operator
9 3 6 xray meetint
10 3 12 xray online examination
11 3 13 xray operator in service paediatric extremities
12 4 4 workplace mental health wellbeing workshop
13 4 7 workplace mental health and wellbeing
Run Code Online (Sandbox Code Playgroud)
一个有趣的替代方案是使用tidytextwithwidyr按单词进行标记,并根据相似的单词(而不是上面的字符)计算标题的余弦相似度。