我经常遇到一个问题,即在连接后合并重复列的非 NA 值并删除重复项。它类似于这个问题或这个问题中所描述的内容。我想围绕coalesce(并可能包括left_join)创建一个小函数,以便在遇到它时在一行中处理它(函数本身当然可以根据需要长度)。
这样做时,我遇到了缺乏此处描述的quo_names等效内容的情况。quos
对于 reprex,采用带有识别信息的数据帧,与包含正确值但经常拼写错误的 ID 的其他数据帧连接。
library(dplyr)
library(rlang)
iris_identifiers <- iris %>%
select(contains("Petal"), Species)
iris_alt_name1 <- iris %>%
mutate(Species = recode(Species, "setosa" = "stosa"))
iris_alt_name2 <- iris %>%
mutate(Species = recode(Species, "versicolor" = "verscolor"))
Run Code Online (Sandbox Code Playgroud)
这个更简单的函数可以工作:
replace_xy <- function(df, var) {
x_var <- paste0(var, ".x")
y_var <- paste0(var, ".y")
df %>%
mutate(!! quo_name(var) := coalesce(!! sym(x_var), !! sym(y_var))) %>%
select(-(!! sym(x_var)), -(!! sym(y_var)))
}
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_xy("Sepal.Length") %>%
replace_xy("Sepal.Width")
head(iris_full)
#> Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#> 1 1.4 0.2 setosa 5.1 3.5
#> 2 1.4 0.2 setosa 4.9 3.0
#> 3 1.4 0.2 setosa 5.0 3.6
#> 4 1.4 0.2 setosa 4.4 2.9
#> 5 1.4 0.2 setosa 5.2 3.4
#> 6 1.4 0.2 setosa 5.5 4.2
Run Code Online (Sandbox Code Playgroud)
但我对如何实现几个变量的泛化有点迷失,我认为这会是更容易的部分。下面的代码片段只是一次绝望的尝试——在尝试了多种变体之后——它大致捕获了我想要实现的目标。
replace_many_xy <- function(df, vars) {
x_var <- paste0(vars, ".x")
y_var <- paste0(vars, ".y")
df %>%
mutate_at(vars(vars), funs(replace_xy(.data, .))) %>%
select(-(!!! syms(x_var)), -(!!! syms(y_var)))
}
new_cols <- colnames(iris_alt_name1)
diff_cols <- new_cols [!(new_cols %in% colnames(iris_identifiers))]
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_many_xy(diff_cols)
#> Warning: Column `Species` joining factors with different levels, coercing
#> to character vector
#> Warning: Column `Species` joining character vector and factor, coercing
#> into character vector
#> Error: Unknown columns `Sepal.Length` and `Sepal.Width`
Run Code Online (Sandbox Code Playgroud)
任何帮助将非常感激!!
我们可以使用 {powerjoin} :
library(powerjoin)
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
power_left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width"), conflict = coalesce_xy) %>%
head()
iris_full
# Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# 1 1.4 0.2 setosa 5.1 3.5
# 2 1.4 0.2 setosa 4.9 3.0
# 3 1.4 0.2 setosa 5.0 3.6
# 4 1.4 0.2 setosa 4.4 2.9
# 5 1.4 0.2 setosa 5.2 3.4
# 6 1.4 0.2 setosa 5.5 4.2
Run Code Online (Sandbox Code Playgroud)
power_left_join是一种改进left_join,除其他外,它允许通过参数处理列冲突,conflict就像我们在这里所做的那样。
参数conflict是一个函数,它依次采用一对冲突的列,从右侧合并可以使用 needconflict = coalesce_yx
这是让你的函数发挥作用的一种方法:
replace_many_xy <- function(tbl, vars){
for(var in vars){
x <- paste0(var,".x")
y <- paste0(var,".y")
tbl <- mutate(tbl, !!sym(var) := coalesce(!!sym(x) , !!sym(y) )) %>%
select(-one_of(x,y))
}
tbl
}
iris_full <- iris_identifiers %>%
left_join(iris_alt_name1, by = c("Species", "Petal.Length", "Petal.Width")) %>%
left_join(iris_alt_name2, by = c("Species", "Petal.Length", "Petal.Width")) %>%
replace_many_xy(diff_cols) %>% as_tibble()
# # A tibble: 372 x 5
# Petal.Length Petal.Width Species Sepal.Length Sepal.Width
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 1.4 0.2 setosa 5.1 3.5
# 2 1.4 0.2 setosa 4.9 3
# 3 1.4 0.2 setosa 5 3.6
# 4 1.4 0.2 setosa 4.4 2.9
# 5 1.4 0.2 setosa 5.2 3.4
# 6 1.4 0.2 setosa 5.5 4.2
# 7 1.4 0.2 setosa 4.6 3.2
# 8 1.4 0.2 setosa 5 3.3
# 9 1.4 0.2 setosa 5.1 3.5
# 10 1.4 0.2 setosa 4.9 3
# # ... with 362 more rows
Run Code Online (Sandbox Code Playgroud)