检查唯一值并在 R data.table 中只有一个唯一值时返回它的最快方法

joh*_*nny 9 r data.table

假设我有一个大的data.table,看起来像dt下面。

dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 10)
)
# dt
#    player_1 player_1_age player_2 player_2_age
# 1:        a           10        b           20
# 2:        b           20        a           10
# 3:        b           20        c           30
# 4:        c           30        a           10
Run Code Online (Sandbox Code Playgroud)

根据dt以上内容,我想创建一个data.table具有独特玩家及其年龄的人,如下所示player_dt

# player_dt
# player  age
#      a   10
#      b   20
#      c   30
Run Code Online (Sandbox Code Playgroud)

为此,我尝试了下面的代码,但在我较大的数据集上花费的时间太长,可能是因为我正在data.tablesapply.

在检查每个是否只有一个唯一值时player_dt,您将如何获得上述信息?playerage

# get unique players
player <- sort(unique(c(dt$player_1, dt$player_2)))

# for each player, get their age, if there is only one age value
age <- sapply(player, function(x) {
  unique_values <- unique(c(
    dt[player_1 == x][["player_1_age"]],
    dt[player_2 == x][["player_2_age"]]))
  if(length(unique_values) > 1) stop() else return(unique_values)
})

# combine to create the player_dt
player_dt <- data.table(player, age)
Run Code Online (Sandbox Code Playgroud)

mar*_*kus 7

我使用来自@DavidT 的数据作为输入。

dt
#   player_1 player_1_age player_2 player_2_age
#1:        a           10        b           20
#2:        b           20        a           10
#3:        b           20        c           30
#4:        c           30        a           11 # <--
Run Code Online (Sandbox Code Playgroud)

TL; 博士

你可以做

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]

out <-
  unique(melt(
    dt,
    measure.vars = list(colsAge, colsOther),
    value.name = c("age", "player")
  )[, .(age, player)])[, if (.N == 1) # credit: /sf/answers/2409956111/
    .SD, by = player]
out
#   player age
#1:      b  20
#2:      c  30
Run Code Online (Sandbox Code Playgroud)

一步步

您可以做的是同时熔化多个列 - 以结尾的列"age"和不以结尾的列。

nm <- names(dt)
idx <- endsWith(nm, "age")
colsAge <- nm[idx]
colsOther <- nm[!idx]
dt1 <- melt(dt, measure.vars = list(colsAge, colsOther), value.name = c("age", "player"))
Run Code Online (Sandbox Code Playgroud)

结果是

dt1
#   variable age player
#1:        1  10      a
#2:        1  20      b
#3:        1  20      b
#4:        1  30      c
#5:        2  20      b
#6:        2  10      a
#7:        2  30      c
#8:        2  11      a
Run Code Online (Sandbox Code Playgroud)

现在我们叫unique...

out <- unique(dt1[, .(age, player)])
out
#   age player
#1:  10      a
#2:  20      b
#3:  30      c
#4:  11      a
Run Code Online (Sandbox Code Playgroud)

...并过滤"player"长度等于 1 的组

out <- out[, if(.N == 1) .SD, by=player]
out
#   player age
#1:      b  20
#2:      c  30
Run Code Online (Sandbox Code Playgroud)

鉴于 OP 的输入数据,不需要最后一步。

数据

library(data.table)
dt <- data.table(
  player_1 = c("a", "b", "b", "c"),
  player_1_age = c(10, 20, 20, 30),
  player_2 = c("b", "a", "c", "a"),
  player_2_age = c(20, 10, 30, 11)
)
Run Code Online (Sandbox Code Playgroud)

参考:https : //cran.r-project.org/web/packages/data.table/vignettes/datatable-reshape.html