比 R 中的嵌套 for 循环更有效的方法——匹配

R m*_*tey 1 for-loop r apply

我试图匹配姓名、姓氏和名字相同的人,并保持 ID 的最小数值。

我在下面创建了一个测试数据库(比我的实际数据集小得多)并编写了一个嵌套的 for 循环,看起来它正在做它应该做的事情。

但是在更大的数据集上它慢得要命。

我对 apply 函数比较陌生,但它们对于应用函数似乎比数据整理更直观。

对于我在这里所做的事情,有什么更有效的替代方法?我敢肯定有一个简单的解决方案会让我在这里问我摇头,但我不会来。

dta.test<- NULL
dta.test$Person_id <- c(1,2,3,4,5,6,7,8,9,10, 11)
dta.test$FirstName <- c("John", "James", "John", "Alex", "Alexander", "Jonathan", "John", "Alex", "James", "John", "John")
dta.test$LastName <- c("Smith", "Jones", "Jones", "Jones", "Jones", "Smith", "Jones", "Smith", "Johnson", "Smith", "Smith")
dta.test$DOB <- c("2001-01-01", "2002-01-01", "2003-01-01", "2004-01-01", "2004-01-01", "2001-01-01", "2003-01-01", "2006-01-01", "2006-01-01", "2001-01-01", "2009-01-01")
dta.test$Actual_ID <- c(1, 2, 3, 4, 5, 6, 3, 8, 9, 1, 11)
dta.test <- as.data.frame(dta.test)

for(i in unique(dta.test$FirstName))
  for(j in unique(dta.test$LastName))
    for (k in unique (dta.test$DOB))
{
  {
    {
       dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
    }
  }
}
Run Code Online (Sandbox Code Playgroud)

CPa*_*Pak 5

这是一个dplyr解决方案

library(dplyr)
dta.test %>%
  group_by(FirstName, LastName, DOB) %>%
  mutate(Person_id = min(Person_id))

# A tibble: 11 x 5
# Groups: FirstName, LastName, DOB [9]
   # Person_id FirstName LastName DOB        Actual_ID
       # <dbl> <fct>     <fct>    <fct>          <dbl>
 # 1        1. John      Smith    2001-01-01        1.
 # 2        2. James     Jones    2002-01-01        2.
 # 3        3. John      Jones    2003-01-01        3.
 # 4        4. Alex      Jones    2004-01-01        4.
 # 5        5. Alexander Jones    2004-01-01        5.
 # 6        6. Jonathan  Smith    2001-01-01        6.
 # 7        3. John      Jones    2003-01-01        3.
 # 8        8. Alex      Smith    2006-01-01        8.
 # 9        9. James     Johnson  2006-01-01        9.
# 10        1. John      Smith    2001-01-01        1.
# 11       11. John      Smith    2009-01-01       11.
Run Code Online (Sandbox Code Playgroud)

编辑- 添加了性能比较

for_loop_approach <- function() {
    for(i in unique(dta.test$FirstName))
      for(j in unique(dta.test$LastName))
        for (k in unique (dta.test$DOB))
    {
      {
        {
           dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k] <- min(dta.test$Person_id[dta.test$FirstName==i & dta.test$LastName==j & dta.test$DOB==k], na.rm=T)
        }
      }
    }
}

dplyr_approach <- function() {
    require(dplyr)
    dta.test %>%
      group_by(FirstName, LastName, DOB) %>%
      mutate(Person_id = min(Person_id))
}

library(microbenchmark)
microbenchmark(for_loop_approach(), dplyr_approach(), unit="relative", times=100L)

Unit: relative
                expr      min      lq    mean   median       uq      max neval
 for_loop_approach() 20.97948 20.6478 18.8189 17.81437 17.91815 11.76743   100
    dplyr_approach()  1.00000  1.0000  1.0000  1.00000  1.00000  1.00000   100
There were 50 or more warnings (use warnings() to see the first 50)
Run Code Online (Sandbox Code Playgroud)

  • 我正在打字,+1 (2认同)