查找在R中生成主键的变量组合

Gee*_*eet 7 r data.table tidyverse

这是我的玩具数据框.

df <- tibble::tribble(
  ~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
    "A",   "C",    1L,    5L,  "AA",  "AB",    1L,
    "A",   "C",    2L,    5L,  "BB",  "AC",    2L,
    "A",   "D",    1L,    7L,  "AA",  "BC",    2L,
    "A",   "D",    2L,    3L,  "BB",  "CC",    1L,
    "B",   "C",    1L,    8L,  "AA",  "AB",    1L,
    "B",   "C",    2L,    6L,  "BB",  "AC",    2L,
    "B",   "D",    1L,    9L,  "AA",  "BC",    2L,
    "B",   "D",    2L,    6L,  "BB",  "CC",    1L)
Run Code Online (Sandbox Code Playgroud)

如何获得唯一标识数据框中观察结果的最小数量变量的组合,即哪些变量可以组成主键

我解决这个问题的方法是找到变量组合,其中不同的值等于数据帧的观察数.那么,在这种情况下,那些将给我8个观察的变量组合.我随机尝试了一下,发现很少:

df %>% distinct(var1, var2, var3)

df %>% distinct(var1, var2, var5)

df %>% distinct(var1, var3, var7)
Run Code Online (Sandbox Code Playgroud)

所以vars123,vars125,vars137在这里值得主键.如何使用R以编程方式找到这些变量组合.另外,如果可能,应更多地优先考虑字符,因子,日期和(可能)整数变量,因为双精度不应该成为主键.

输出可以是列表或数据框,说明组合"var1,var2,var3","var1,var2,var5","var1,var3,var7".

the*_*ail 4

其他答案有点不同,但这是请求的表格输出:

nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
  vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
  counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)
Run Code Online (Sandbox Code Playgroud)

然后取主键所需的最少变量数:

out[match(nrow(df), out$counts),]
#        vars counts
#12 var1,var6      8
Run Code Online (Sandbox Code Playgroud)