我有两个数据帧(x和y),其中ID是student_name
,father_name
和mother_name
.由于印刷错误("n"而不是"m",随机白色空间等),我有大约60%的值没有对齐,尽管我可以关注数据并看到它们应该.有没有办法以某种方式降低不匹配的级别,以便手动编辑,因为至少可行?数据帧有大约700K的观测值.
R最好.我知道一点python,以及一些基本的unix工具.PS我读过agrep()
,但不明白它如何对实际数据集起作用,特别是当匹配超过一个变量时.
更新(已发布奖金的数据):
这是两个示例数据框,sites_a
和sites_b
.他们可以在数字列匹配lat
和lon
以及在sitename
列.知道如何在a)只是lat
+ lon
,b)sitename
或c)两者上完成这将是有用的.
您可以获取文件test_sites.R,该文件作为要点发布.
理想情况下,答案将以此结束
merge(sites_a, sites_b, by = **magic**)
Run Code Online (Sandbox Code Playgroud)
agrep
使用Levenshtein编辑距离进行近似字符串匹配的函数(基数R的一部分)可能值得尝试.在不知道您的数据是什么样的情况下,我无法真正建议一个有效的解决方案.但这是一个建议......它在一个单独的列表中记录匹配(如果有多个同样好的匹配,那么这些也被记录).假设您的data.frame被调用df
:
l <- vector('list',nrow(df))
matches <- list(mother = l,father = l)
for(i in 1:nrow(df)){
father_id <- with(df,which(student_name[i] == father_name))
if(length(father_id) == 1){
matches[['father']][[i]] <- father_id
} else {
old_father_id <- NULL
## try to find the total
for(m in 10:1){ ## m is the maximum distance
father_id <- with(df,agrep(student_name[i],father_name,max.dist = m))
if(length(father_id) == 1 || m == 1){
## if we find a unique match or if we are in our last round, then stop
matches[['father']][[i]] <- father_id
break
} else if(length(father_id) == 0 && length(old_father_id) > 0) {
## if we can't do better than multiple matches, then record them anyway
matches[['father']][[i]] <- old_father_id
break
} else if(length(father_id) == 0 && length(old_father_id) == 0) {
## if the nearest match is more than 10 different from the current pattern, then stop
break
}
}
}
}
Run Code Online (Sandbox Code Playgroud)
该代码mother_name
基本相同.你甚至可以将它们放在一个循环中,但这个例子只是为了说明的目的.
这将获取常见列名称列表,基于agrep
所有这些列组合的匹配,然后如果all.x
或all.y
等于TRUE,则附加填充缺少列的非匹配记录和NA.merge
与之不同,要匹配的列名称需要在每个数据框中相同.挑战似乎是agrep
正确设置选项以避免虚假匹配.
agrepMerge <- function(df1, df2, by, all.x = FALSE, all.y = FALSE,
ignore.case = FALSE, value = FALSE, max.distance = 0.1, useBytes = FALSE) {
df1$index <- apply(df1[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
df2$index <- apply(df2[,by, drop = FALSE], 1, paste, sep = "", collapse = "")
matches <- lapply(seq_along(df1$index), function(i, ...) {
agrep(df1$index[i], df2$index, ignore.case = ignore.case, value = value,
max.distance = max.distance, useBytes = useBytes)
})
df1_match <- rep(1:nrow(df1), sapply(matches, length))
df2_match <- unlist(matches)
df1_hits <- df1[df1_match,]
df2_hits <- df2[df2_match,]
df1_miss <- df1[setdiff(seq_along(df1$index), df1_match),]
df2_miss <- df2[setdiff(seq_along(df2$index), df2_match),]
remove_cols <- colnames(df2_hits) %in% colnames(df1_hits)
df_out <- cbind(df1_hits, df2_hits[,!remove_cols])
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df1_miss))
df1_miss[missing_cols] <- NA
df_out <- rbind(df_out, df1_miss)
}
if(all.x) {
missing_cols <- setdiff(colnames(df_out), colnames(df2_miss))
df2_miss[missing_cols] <- NA
df_out <- rbind(df_out, df2_miss)
}
df_out[,setdiff(colnames(df_out), "index")]
}
Run Code Online (Sandbox Code Playgroud)