合并后是否有可用的_merge指示器?

ℕʘʘ*_*ḆḽḘ 8 r dplyr

有没有办法_merge在合并后获得等效的指标变量dplyr

类似于Pandas indicator = True选项的东西基本上告诉你合并是如何进行的(来自每个数据集的匹配数等).

这是一个例子 Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)
Run Code Online (Sandbox Code Playgroud)

在这里,经过left join之间df1df2,你想立刻知道多少行df1找到了匹配中df2,有多少人没

match
Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only
Run Code Online (Sandbox Code Playgroud)

我可以将这个merge变量制成表格:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64
Run Code Online (Sandbox Code Playgroud)

我没有看到任何可用的选项,例如,左连接 dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA
Run Code Online (Sandbox Code Playgroud)

我在这里错过了什么吗?谢谢!

Ada*_*ier 6

Stata _merge在执行任何类型的合并或连接时类似地创建了一个新变量.我也发现有一个选项可以在执行后快速诊断合并.

在过去的几个月里,我一直在使用我编写的基本功能,只是修饰dplyr连接.可能有更有效的方法来做到这一点,但这里有一个修饰的例子full_join.如果设置选项,.merge = T您将获得一个变量,称为.merge类似于StataPandas中的 _merge .(这也打印出一个诊断消息,关于每次使用它时匹配的数量和不匹配的数量.)我知道你已经有了问题的答案,但如果你想要一个功能,你可以反复使用,它的工作方式相同以full_joindplyr,这里是一个开始.你显然需要加载dplyr来完成这项工作......

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
                        .merge = FALSE, ...){

# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
    message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
    message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
    stop("Variable .merge already exists; change name before proceeding")
}

# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1

# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix,  ...)

# Calculating merge diagnoses 
matched <- joined %>%
    filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()
unmatched_x <- joined %>%
    filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
    NROW()
unmatched_y <- joined %>%
    filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()

# Print merge diagnoses
message(
    unmatched_x, " Rows ONLY from left data frame", "\n",
    unmatched_y, " Rows ONLY from right data frame", "\n",
    matched, " Rows matched"
)

# Create .merge variable if specified
if(.merge){
    joined <- joined %>%
        mutate(.merge = 
                   case_when(
                       !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                       is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                       TRUE ~ "matched"
                       )
               )
}

# Dropping tracker variables and returning data frame
joined <- joined %>%
    select(-.x_tracker, -.y_tracker)
return(joined)
}
Run Code Online (Sandbox Code Playgroud)

举个例子:

data1 <- data.frame(x = 1:10, y = rnorm(10))
data2 <- data.frame(x = 4:20, z = rnorm(17))
full_join_track(data1, data2, .merge = T)
Run Code Online (Sandbox Code Playgroud)


akr*_*run 3

我们根据 来创建“合并”列inner_joinanti_join然后将行与bind_rows

d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
                    mutate(merge = "both")  
bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>% 
             mutate(merge = 'left_only'))
Run Code Online (Sandbox Code Playgroud)