如何使用R精确匹配整个数据集中的两列值

Sop*_*son 6 r ggplot2 dataframe dplyr tidyverse

我在 R 中有下面提到的两个数据框,我尝试了各种方法但还没有达到所需的输出。

DF:

ID     Date                 city        code    uid
I-1    2020-01-01 10:12:15  New York     123    K-1
I-1    2020-01-01 10:12:15  Utha         103    K-1
I-2    2020-01-02 10:12:15  Washington   122    K-1
I-3    2020-02-01 10:12:15  Tokyo        123    K-2
I-3    2020-02-01 10:12:15  Osaka        193    K-2
I-4    2020-02-02 10:12:15  London       144    K-3
I-5    2020-02-04 10:12:15  Dubai        101    K-4
I-6    2019-11-01 10:12:15  Dubai        101    K-4
I-7    2019-11-01 10:12:15  London       144    K-3
I-8    2018-12-13 10:12:15  Tokyo        143    K-5
I-9    2019-05-17 10:12:15  Dubai        101    K-4
I-19   2020-03-11 10:12:15  Dubai        150    K-7
Run Code Online (Sandbox Code Playgroud)

输入:

structure(list(ID = c("I-1", "I-1", 
"I-2", "I-3", "I-3", "I-4", 
"I-5", "I-6", "I-7", "I-8", "I-9","I-19" 
), DATE = c("2020-01-01 11:49:40.842", "2020-01-01 09:35:33.607", 
"2020-01-02 06:14:58.731", "2020-02-01 16:51:27.190", "2020-02-01 05:35:46.952", 
"2020-02-02 05:48:49.443", "2020-02-04 10:00:41.616", "2019-11-01 09:10:46.536", 
"2019-11-01 11:54:05.655", "2018-12-13 14:24:31.617", "2019-05-17 14:24:31.617", "2020-03-11 14:24:31.617"), CITY = c("New York", 
"UTAH", "Washington", "Tokyo", 
"Osaka", "London", "Dubai", 
"Dubai", "London", "Tokyo", "Dubai", 
"Dubai"), CODE = c("221010", 
"411017", "638007", "583101", "560029", "643102", "363001", "452001", 
"560024", "509208"), UID = c("K-1", 
"K-1", "K-1", "K-2", "K-2", 
"K-3", "K-4", "K-4", "K-3", 
"K-5","K-4","K-7")), .Names = c("ID", "DATE", 
"CITY", "CODE", "UID"), row.names = c(NA, 
10L), class = "data.fram)
Run Code Online (Sandbox Code Playgroud)

使用上述两个数据框,我想获取 2020 年 1 月 1 日至 2002 年 2 月 29 日之间的记录,并在整个数据库中比较这些 ID 以检查城市和代码是否与其他 ID 匹配并进一步分类以检查有多少相同的 uid 和多少有不同。

在哪里,

  • 匹配 - 城市和代码的组合与数据库中的其他 ID 匹配
  • Same_uid - 匹配 id 的分类,以确定有多少 ID 具有相似的 uid
  • different_uid - 匹配 id 的分类,以确定有多少 ID 没有相似的 uid
  • uid_count - 整个数据库中该特定 ID 的类似 uid 的计数

注意 - 我在数据框中有超过 10M 的记录。

所需输出

ID      Date                  city         code   uid   Match   Same_uid   different_uid  uid_count
I-1     2020-01-01 10:12:15   New York     123    K-1    No      0          0              2
I-2     2020-01-02 10:12:15   Washington   122    K-1    No      0          0              2
I-3     2020-02-01 10:12:15   Tokyo        123    K-2    No      0          0              1   
I-4     2020-02-02 10:12:15   London       144    K-3    Yes     1          0              2
I-5     2020-02-04 10:12:15   Dubai        101    K-4    Yes     2          0              3 
Run Code Online (Sandbox Code Playgroud)

小智 1

一种方法,

加载到数据集中

    library(tidyverse)
    library(lubridate)

   

   mydata <- tibble(
   ID = c("I-1","I-1",
          "I-2","I-3",
          "I-3","I-4",
          "I-5","I-6",
          "I-7","I-8",
          "I-9","I-19"),
   Date = c("2020-01-01", "2020-01-01",
            "2020-01-02", "2020-02-01",
            "2020-02-01", "2020-02-02",
            "2020-02-04", "2019-11-01", 
            "2019-11-01", "2018-12-13", 
            "2019-05-17", "2020-03-11"),
   city = c("New York", "Utha", 
            "Washington", "Tokyo", 
            "Osaka", "London", 
            "Dubai", "Dubai", 
            "London", "Tokyo", 
            "Dubai", "Dubai"),
   code = c("123", "103", "122", "123", "193, "144",
            "101", "101", "144", "143", "101", "150"),
   uid = c("K-1", "K-1", "K-1", "K-2", "K-2", "K-3",
           "K-4", "K-4", "K-3", "K-5", "K-4", "K-7"))

   mydata <- mydata %>% 
     mutate(Date = ymd(str_remove(Date, " .*")),
            code = as.character(code))
Run Code Online (Sandbox Code Playgroud)

其中第 1 号子句

我使用countfrom dplyr 按城市计算代码。然后case_when根据要求进一步确定“是”或“否”。

# This counts city and code, and fullfills your "Match" column requirement
startdate <- "2017-01-01"
enddate <-   "2020-03-29"
mydata %>% 
  filter(Date >= startdate,
         Date <= enddate) %>%
  count(city, code, name = "count_samecode") %>%   
  mutate(Match = case_when(
   count_samecode > 1 ~ "Yes",
                    T ~ "No")) %>%
  head()
# # A tibble: 6 x 4
#  city     code  count_samecode Match
# <chr>    <chr>          <int> <chr>
# 1 Dubai    101                3 Yes  
# 2 Dubai    150                1 No   
# 3 London   144                2 Yes  
# 4 New York 123                1 No   
# 5 Osaka    193                1 No   
# 6 Tokyo    123                1 No  
Run Code Online (Sandbox Code Playgroud)

其中子句 2

我也会对 UID 做同样的事情

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>%
  head()

# # A tibble: 6 x 3
# city     uid   `UIDs_#_filtered`
# <chr>    <chr>             <int>
# 1 Dubai    K-4                   3
# 2 Dubai    K-7                   1
# 3 London   K-3                   2
# 4 New York K-1                   1
# 5 Osaka    K-2                   1
# 6 Tokyo    K-2                   1
Run Code Online (Sandbox Code Playgroud)

其中第 3 条

我可以重复count第 2 条语句来查找这些城市中有多少个具有不同的 UID,其中 > 1 表示不同的 UID。

mydata %>% 
  filter(Date >= startdate,
         Date <= enddate ) %>% 
  count(city, uid, name = "UIDs_#_filtered") %>% 
  count(city, name = "UIDs_#_different") %>% 
  head()
# # A tibble: 6 x 2
# city     `UIDs_#_different`
# <chr>                 <int>
# 1 Dubai                     2
# 2 London                    1
# 3 New York                  1
# 4 Osaka                     1
# 5 Tokyo                     2
# 6 Utha                      1
Run Code Online (Sandbox Code Playgroud)

其中第 4 条子句

使用 #2 中的相同代码,我可以消除过滤器来查找整个数据集

mydata %>% 
  count(city, uid, name = "UIDs_#_all") %>% 
  head()
Run Code Online (Sandbox Code Playgroud)

把它们放在一起

使用几个left_join' 我们可以更接近您想要的输出。编辑:现在将从第一个城市/代码组合中引入 ID 的第一个实例

check_duplicates_filterview.f <- function( df, startdate, enddate ){
  # df should be a tibble
  # startdate should be a string "yyyy-mm-dd"
  # enddate should be a string   "yyyy-mm-dd"
  cityfilter <- df %>% filter(Date >= startdate,
                              Date <= enddate) %>% distinct(city) %>% pull(1)
  df <- df %>% 
    filter(city %in% cityfilter) %>% 
    mutate(Date = ymd(str_remove(Date, " .*")),
           code = as.character(code))
  entire.db.countcodes <- df %>%              # Finds count of code in entire DB
    count(city, code) 
  where.1 <- df %>% filter(Date >= startdate, 
                     Date <= enddate) %>% 
    distinct(city, code, .keep_all = T) %>%
    left_join(entire.db.countcodes)  %>% 
    rename("count_samecode" = n) %>% 
    mutate(Match = case_when(
      count_samecode > 1 ~ "Yes",
      T ~ "No"))
  
  where.2 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    count(city, uid, name = "UIDs_#_filtered")
  where.3 <- df %>% 
    filter(Date >= startdate,
           Date <= enddate ) %>% 
    distinct(city, uid) %>% 
    count(city, name = "UIDs_#_distinct") 
  where.4 <- df %>% 
    filter(city %in% cityfilter) %>% 
    count(city, uid, name = "UIDs_#_all")
  first_half <- left_join(where.1, where.2)
  second_half <- left_join(where.4, where.3)
  full <- left_join(first_half, second_half)
  return(full)
}


# > check_duplicates_filterview.f(mydata, "2018-01-01", "2020-01-01")
# Joining, by = "city"
# Joining, by = "city"
# Joining, by = c("city", "uid")
# # A tibble: 5 x 8
# city     code  count_samecode Match uid   `UIDs_#_filtered` `UIDs_#_all` `UIDs_#_distinct`
# <chr>    <chr>          <int> <chr> <chr>             <int>        <int>             <int>
# 1 Dubai    101                2 Yes   K-4                   2            3                 1
# 2 London   144                1 No    K-3                   1            2                 1
# 3 New York 123                1 No    K-1                   1            1                 1
# 4 Tokyo    143                1 No    K-5                   1            1                 1
# 5 Utha     103                1 No    K-1                   1            1                 1
Run Code Online (Sandbox Code Playgroud)