如何在R数据帧中获得独占计数

use*_*845 3 r dataframe

我在 R 中有下面提到的数据框:

DF <- tibble::tribble(
    ~ID, ~Check,
  "I-1",   "A1",
  "I-2",   "A2",
  "I-2",   "OT",
  "I-2",   "LP",
  "I-3",   "A1",
  "I-3",   "A2",
  "I-4",     NA,
  "I-5",     NA,
  "I-6",   "A1",
  "I-6",   "OT",
  "I-7",   "A2"
  )

DF2 <- tibble::tribble(
    ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}"
  )
Run Code Online (Sandbox Code Playgroud)

使用上面提到的两个数据框,我需要以下格式的输出:

我想确定每个唯一 ID的唯一计数CheckRemark每个唯一 ID的唯一计数,以及每个 IDCheck与另一个ID 的组合Check以及与Remark.

注 - 行的顺序应基于Exclusive_Countof由高到低Check。这很可能是独一无二的编号Check,并Remark可能在我的实际数据帧不同。(即 10 个独特的Remark和 5 个Check,像这样)

DF_输出<-

Remark   Exclusive_Count  %       X1  X2  XR  XT  Check  Exclusive_Count  %          A1  A2  OT  LP
Blank    1               33.33%   0   0   0   0   Blank  2                50.00%     0   0   0   0
X1       1               33.33%   0   2   2   1   A1     1                25.00      0   1   1   0
X2       0               0.00%    2   0   1   0   A2     1                25.00%     1   0   1   1
XR       0               0.00%    2   2   0   1   OT     0                0.00%      1   1   0   1
XT       1               33.33%   1   0   1   0   LP     0                0.00%      0   1   1   0
Total    3               100.00%  5   4   4   2   Total  4                100.00%    2   3   3   2
                                               
Run Code Online (Sandbox Code Playgroud)

Uwe*_*Uwe 6

OP 已请求规范答案。因此,我创建了一个函数 get_exclusive_counts(),它采用任何 tibble、data.frame 或 data.table 的前两列,其中第一列包含ID,第二列包含有效载荷,例如,Check长格式。

该函数独立于列名,并且可以处理有效负载列中任意数量的不同项目。它为每个输入小标题返回一个 data.table:

get_exclusive_counts(DF)
Run Code Online (Sandbox Code Playgroud)
    Check Exclusive_Count       % A1 A2 LP OT
1:  Blank               2  50.00%  0  0  0  0
2:     A1               1  25.00%  0  1  0  1
3:     A2               1  25.00%  1  0  1  1
4:     LP               0   0.00%  0  1  0  1
5:     OT               0   0.00%  1  1  1  0
6: Totals               4 100.00%  2  3  2  3
Run Code Online (Sandbox Code Playgroud)

对于第二个用例DF2,需要事先将有效负载拆分为单独的行:

library(magrittr)
DF2 %>% 
  dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
  tidyr::separate_rows(Remarks) %>% 
  get_exclusive_counts() 
Run Code Online (Sandbox Code Playgroud)
   Remarks Exclusive_Count       % X1 X2 XR XT
1:   Blank               1  33.33%  0  0  0  0
2:      X1               1  33.33%  0  2  2  1
3:      XT               1  33.33%  1  0  1  0
4:      X2               0   0.00%  2  0  2  0
5:      XR               0   0.00%  2  2  0  1
6:  Totals               3 100.00%  5  4  5  2
Run Code Online (Sandbox Code Playgroud)

请注意,结果表的第一列的名称已从输入 data.frame 中保留。

的OP已经提到的数目RemarksCheck可能不同。因此,这cbind()对两个结果表没有意义,因为这只会在行数相同的情况下给出合理的结果。

此外,OP 的预期结果有一些重复的列名(至少Exclusive_Count,,%可能更多),这表明结果可能不用于进一步处理,而仅用于显示/打印。

并排打印结果

但是,我创建了一个函数get_exclusive_counts_side_by_side()来打印调用的结果get_exclusive_counts()

  • 对于任意数量的输入数据集,
  • 具有不同的行数,以及
  • 最后几行 ( Totals) 对齐。

该函数返回一个带有字符列的 data.table。

下面的调用将重现 OP 的预期结果:

get_exclusive_counts_side_by_side(
  DF2 %>% 
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
    tidyr::separate_rows(Remarks),
  DF)
Run Code Online (Sandbox Code Playgroud)
   Remarks Exclusive_Count       % X1 X2 XR XT  Check Exclusive_Count       % A1 A2 LP OT
1:   Blank               1  33.33%  0  0  0  0  Blank               2  50.00%  0  0  0  0
2:      X1               1  33.33%  0  2  2  1     A1               1  25.00%  0  1  0  1
3:      XT               1  33.33%  1  0  1  0     A2               1  25.00%  1  0  1  1
4:      X2               0   0.00%  2  0  2  0     LP               0   0.00%  0  1  0  1
5:      XR               0   0.00%  2  2  0  1     OT               0   0.00%  1  1  1  0
6:  Totals               3 100.00%  5  4  5  2 Totals               4 100.00%  2  3  2  3
Run Code Online (Sandbox Code Playgroud)

这是另一个用例来证明它可以处理不同的行和任意数量的输入数据集:

get_exclusive_counts_side_by_side(
  DF, 
  DF3 %>% 
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>% 
    tidyr::separate_rows(Remarks),
  DF)
Run Code Online (Sandbox Code Playgroud)
    Check Exclusive_Count       % A1 A2 LP OT Remarks Exclusive_Count       % X1 X2 XR XT Y2 Y3 Y4  Check Exclusive_Count       % A1 A2 LP OT
1:  Blank               2  50.00%  0  0  0  0      X1               2  50.00%  0  2  2  1  1  1  0  Blank               2  50.00%  0  0  0  0
2:     A1               1  25.00%  0  1  0  1   Blank               1  25.00%  0  0  0  0  0  0  0     A1               1  25.00%  0  1  0  1
3:     A2               1  25.00%  1  0  1  1      XT               1  25.00%  1  0  1  0  0  0  0     A2               1  25.00%  1  0  1  1
4:     LP               0   0.00%  0  1  0  1      X2               0   0.00%  2  0  2  0  0  0  0     LP               0   0.00%  0  1  0  1
5:     OT               0   0.00%  1  1  1  0      XR               0   0.00%  2  2  0  1  0  0  0     OT               0   0.00%  1  1  1  0
6:                                                 Y2               0   0.00%  1  0  0  0  0  1  1                                         
7:                                                 Y3               0   0.00%  1  0  0  0  1  0  0                                         
8:                                                 Y4               0   0.00%  0  0  0  0  1  0  0                                         
9: Totals               4 100.00%  2  3  2  3  Totals               4 100.00%  7  4  5  2  3  2  1 Totals               4 100.00%  2  3  2  3
Run Code Online (Sandbox Code Playgroud)

功能定义

代码看起来相当庞大,但有一半的行是注释。因此,代码应该是不言自明的。

此外,大约一半的代码行是由于 OP 的附加要求,例如%列或Totals行。

get_exclusive_counts <- function(DF) {
  library(data.table)
  library(magrittr)
  # make copy of first 2 cols to preserve original attributes of DF
  DT <- as.data.table(DF[, 1:2])
  # retain original column names
  old <- colnames(DT)[1:2]
  # rename colnames in copy for convenience of programming
  setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
  # aggregate by id to find exclusive counts = ids with only one element
  tmp <- DT[, .N, keyby = id][N == 1L]
  # create table of exclusive counts by joining and aggregating
  excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>% 
    # append column of proportions, will be formatted after computing Totals
    .[, `%` := Exclusive_Count / sum(Exclusive_Count)]
  # anti-join to find remaining rows
  rem <- DT[!tmp, on = .(id)]
  # create co-occurrence matrix in long format by a self-join
  coocc <-   rem[rem, on = .(id), allow.cartesian = TRUE] %>% 
    # reshape to wide format and compute counts of co-occurrences w/o diagonals
    dcast(val ~ i.val, length, subset = .(val != i.val))
  # build final result table by merging both subresults
  merge(excl, coocc, by = "val", all = TRUE) %>% 
    # replace NA counts by 0 
    .[, lapply(.SD, nafill, fill = 0L), by = val] %>% 
    # clean-up: order by decreasing Exclusive_Counts %>% 
    .[order(-Exclusive_Count)] %>% 
    # append Totals row
    rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>% 
    # clean-up: format proportion as percentage
    .[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>% 
    # clean-up: Replace <NA> by "Blank" in val column
    .[is.na(val), val := "Blank"] %>%
    # rename val column
    setnames("val", old[2]) %>% 
    # return result visibly
    .[]
}
Run Code Online (Sandbox Code Playgroud)

这是代码get_exclusive_counts_side_by_side()

get_exclusive_counts_side_by_side <- function(...) {
  library(data.table)
  library(magrittr)
  # process input, return list of subresults
  ec_list<- list(...) %>% 
    lapply(get_exclusive_counts)
  # create row indices for maximum rows
  rid <- ec_list %>% 
    lapply(nrow) %>%
    Reduce(max, .) %>% 
    {data.table(.rowid = 1:.)}
  # combine subresults 
  ec_list %>% 
    # insert empty rows if necessary
    lapply(function(.x) .x[
      , .rowid := .I][
        # but align last row
        .rowid == .N, .rowid := nrow(rid)][
          rid, on =.(.rowid)][
            , .rowid := NULL]
    ) %>%  
    # all data.tables have the same number of rows, now cbind()
    do.call(cbind, .) %>% 
    # replace all NA by empty character strings
    .[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
Run Code Online (Sandbox Code Playgroud)

补充说明

如果我理解正确,独占计数是指仅分配了一个项目(或)的IDNA。这是相当直接的计算

  1. 计算每个ID的项目数,
  2. 选择只有一个项目的ID
  3. 在输入 data.frame 中选择属于这些ID 的行(使用连接),以及
  4. 计算独占行子集中项目的出现次数。

此外,该功能处理 OP 的附加要求,这些要求超出了排他计数的识别:

  • 添加剩余非排他行的共现计数矩阵,
  • 在特定位置添加一列排他计数的比例并将其格式化为百分比,
  • 添加总计行,
  • NA零或"Blank",分别替换s 。

数据

DF <- tibble::tribble(
  ~ID, ~Check,
  "I-1",   "A1",
  "I-2",   "A2",
  "I-2",   "OT",
  "I-2",   "LP",
  "I-3",   "A1",
  "I-3",   "A2",
  "I-4",     NA,
  "I-5",     NA,
  "I-6",   "A1",
  "I-6",   "OT",
  "I-7",   "A2"
)

DF2 <- tibble::tribble(
  ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}"
)

DF3 <- tibble::tribble(
  ~ID,     ~Remarks,
  "I-1", "{X1,XR,XT}",
  "I-2",    "{X2,XR}",
  "I-3",           NA,
  "I-4", "{X1,XR,X2}",
  "I-5",       "{X1}",
  "I-6",       "{XT}",
  "I-7",    "{X1,X2}",
  "I-8", "{X1,Y2,Y3}",
  "I-9",    "{Y2,Y4}",
  "I10",       "{X1}",
)
Run Code Online (Sandbox Code Playgroud)