如何在R中使用Fuzzyjoin :: difference_ *执行模糊联接

bri*_*enb 6 r fuzzy-comparison fuzzyjoin

我正在使用两个要基于阈值合并的不同数据集。假设两个数据帧如下所示:

library(dplyr)
library(fuzzyjoin)
library(lubridate)

df1 = data_frame(Item=1:5, 
                 DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23", 
                            "2015-01-02 15:46:11", "2015-04-19 22:11:33", 
                            "2015-06-10 07:00:00"), 
                 Count=c(1, 6, 11, 15, 9), 
                 Name="Sterling", 
                 Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime = ymd_hms(df1$DateTime)

df2 = data_frame(Item=21:25, 
                 DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:15:23", 
                            "2015-01-02 15:46:11", "2015-05-19 22:11:33", 
                            "2015-06-10 07:00:02"), 
                 Count=c(3, 7, 11, 15, 8), 
                 Name="Sterling", 
                 Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime = ymd_hms(df2$DateTime)
Run Code Online (Sandbox Code Playgroud)

我现在想,是能够左连接df2df1基于的模糊匹配DateTime,并Count为各自的价值在两秒钟内,而除了所有其他值Item都相同。我以为我可以做到以下几点:

df1 %>%
  difference_left_join(df2, by=c("DateTime", "Count"), max_dist=2)
Run Code Online (Sandbox Code Playgroud)

但这给了我以下输出:

 # A tibble: 8 × 10
  Item.x          DateTime.x Count.x   Name.x Friend.x Item.y          DateTime.y Count.y   Name.y  Friend.y
   <int>              <dttm>   <dbl>    <chr>    <chr>  <int>              <dttm>   <dbl>    <chr>     <chr>
1      1 2015-01-01 11:12:14       1 Sterling      Pam     21 2015-01-01 11:12:15       3 Sterling       Pam
2      1 2015-01-01 11:12:14       1 Sterling      Pam     21 2015-01-01 11:12:15       3 Sterling       Pam
3      2 2015-01-02 09:15:23       6 Sterling    Cyril     NA                <NA>      NA     <NA>      <NA>
4      3 2015-01-02 15:46:11      11 Sterling   Cheryl     23 2015-01-02 15:46:11      11 Sterling Woodhouse
5      3 2015-01-02 15:46:11      11 Sterling   Cheryl     23 2015-01-02 15:46:11      11 Sterling Woodhouse
6      4 2015-04-19 22:11:33      15 Sterling  Mallory     NA                <NA>      NA     <NA>      <NA>
7      5 2015-06-10 07:00:00       9 Sterling     Lana     25 2015-06-10 07:00:02       8 Sterling      Lana
8      5 2015-06-10 07:00:00       9 Sterling     Lana     25 2015-06-10 07:00:02       8 Sterling      Lana
Run Code Online (Sandbox Code Playgroud)

这很接近,只是鉴于名称不同,第3行不应该合并(并且我希望第2行在给定阈值的情况下合并,即使我不希望合并)。

如何得到以下数据框?需要注意的是在第二排和第三排从df2没有被合并,尽管DateTimeCount满足阈值限制。这是因为其他列(除外Item)不相同。

desired_output
#   Item            DateTime Count     Name  Friend
# 1    3 2015-01-02 15:46:11    11 Sterling  Cheryl
# 2    2 2015-01-02 09:15:23     6 Sterling   Cyril
# 3    5 2015-06-10 07:00:00     9 Sterling    Lana
# 4   25 2015-06-10 07:00:02     8 Sterling    Lana
# 5    4 2015-04-19 22:11:33    15 Sterling Mallory
# 6    1 2015-01-01 11:12:14     1 Sterling     Pam
# 7   21 2015-01-01 11:12:15     3 Sterling     Pam
Run Code Online (Sandbox Code Playgroud)

Hac*_*k-R 5

好的,因此,您收到的消息是因为无法在非数字列上计算模糊匹配。

要做的是将其转换为数字。由于您的卡尺以秒为单位,因此我将其转换为秒,然后将其设置为数字:

library(dplyr)
library(fuzzyjoin)
library(lubridate)

df1 = data_frame(Item=1:5, 
                 DateTime=c("2015-01-01 11:12:14", "2015-01-02 09:15:23", 
                            "2015-01-02 15:46:11", "2015-04-19 22:11:33", 
                            "2015-06-10 07:00:00"), 
                 Count=c(1, 6, 11, 15, 9), 
                 Name="Sterling", 
                 Friend=c("Pam", "Cyril", "Cheryl", "Mallory", "Lana"))
df1$DateTime1 = as.numeric(seconds(ymd_hms(df1$DateTime)))

df2 = data_frame(Item=21:25, 
                 DateTime=c("2015-01-01 11:12:15", "2015-01-02 19:25:56", 
                            "2015-01-02 15:46:11", "2015-05-19 22:11:33", 
                            "2015-06-10 07:00:02"), 
                 Count=c(3, 6, 11, 15, 8), 
                 Name="Sterling", 
                 Friend=c("Pam", "Kreger", "Woodhouse", "Gillete", "Lana"))
df2$DateTime1 = as.numeric(seconds(ymd_hms(df2$DateTime)))

df1 %>%
  difference_left_join(y=df2, by=c("DateTime1", "Count"), max_dist=2)
Run Code Online (Sandbox Code Playgroud)

根据我们在评论中的讨论,一个简单的调整就可以将其子集化为其他字符列匹配的情况:

df1[df2$Friend == df1$Friend,] %>%
  difference_left_join(y=df2[df2$Friend == df1$Friend,], by=c("DateTime1", "Count"), max_dist=2)
Run Code Online (Sandbox Code Playgroud)

该示例仅用于,Friend但您当然可以使用&它来处理多列。