我在 R 中有下面提到的数据框:
DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
Run Code Online (Sandbox Code Playgroud)
使用上面提到的两个数据框,我需要以下格式的输出:
我想确定每个唯一 ID的唯一计数Check和Remark每个唯一 ID的唯一计数,以及每个 IDCheck与另一个ID 的组合Check以及与Remark.
注 - 行的顺序应基于Exclusive_Countof由高到低Check。这很可能是独一无二的编号Check,并Remark可能在我的实际数据帧不同。(即 10 个独特的Remark和 5 个Check,像这样)
DF_输出<-
Remark Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 OT LP
Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
X1 1 33.33% 0 2 2 1 A1 1 25.00 0 1 1 0
X2 0 0.00% 2 0 1 0 A2 1 25.00% 1 0 1 1
XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 0 1
XT 1 33.33% 1 0 1 0 LP 0 0.00% 0 1 1 0
Total 3 100.00% 5 4 4 2 Total 4 100.00% 2 3 3 2
Run Code Online (Sandbox Code Playgroud)
OP 已请求规范答案。因此,我创建了一个函数 get_exclusive_counts(),它采用任何 tibble、data.frame 或 data.table 的前两列,其中第一列包含ID,第二列包含有效载荷,例如,Check长格式。
该函数独立于列名,并且可以处理有效负载列中任意数量的不同项目。它为每个输入小标题返回一个 data.table:
get_exclusive_counts(DF)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Check Exclusive_Count % A1 A2 LP OT 1: Blank 2 50.00% 0 0 0 0 2: A1 1 25.00% 0 1 0 1 3: A2 1 25.00% 1 0 1 1 4: LP 0 0.00% 0 1 0 1 5: OT 0 0.00% 1 1 1 0 6: Totals 4 100.00% 2 3 2 3
对于第二个用例DF2,需要事先将有效负载拆分为单独的行:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Remarks Exclusive_Count % X1 X2 XR XT 1: Blank 1 33.33% 0 0 0 0 2: X1 1 33.33% 0 2 2 1 3: XT 1 33.33% 1 0 1 0 4: X2 0 0.00% 2 0 2 0 5: XR 0 0.00% 2 2 0 1 6: Totals 3 100.00% 5 4 5 2
请注意,结果表的第一列的名称已从输入 data.frame 中保留。
的OP已经提到的数目Remarks和Check可能不同。因此,这cbind()对两个结果表没有意义,因为这只会在行数相同的情况下给出合理的结果。
此外,OP 的预期结果有一些重复的列名(至少Exclusive_Count,,%可能更多),这表明结果可能不用于进一步处理,而仅用于显示/打印。
但是,我创建了一个函数get_exclusive_counts_side_by_side()来打印调用的结果get_exclusive_counts()
Totals) 对齐。该函数返回一个带有字符列的 data.table。
下面的调用将重现 OP 的预期结果:
get_exclusive_counts_side_by_side(
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Remarks Exclusive_Count % X1 X2 XR XT Check Exclusive_Count % A1 A2 LP OT 1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0 2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1 3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1 4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1 5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0 6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3
这是另一个用例来证明它可以处理不同的行和任意数量的输入数据集:
get_exclusive_counts_side_by_side(
DF,
DF3 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks),
DF)
Run Code Online (Sandbox Code Playgroud)
Run Code Online (Sandbox Code Playgroud)Check Exclusive_Count % A1 A2 LP OT Remarks Exclusive_Count % X1 X2 XR XT Y2 Y3 Y4 Check Exclusive_Count % A1 A2 LP OT 1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0 2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1 3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1 4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1 5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0 6: Y2 0 0.00% 1 0 0 0 0 1 1 7: Y3 0 0.00% 1 0 0 0 1 0 0 8: Y4 0 0.00% 0 0 0 0 1 0 0 9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3
代码看起来相当庞大,但有一半的行是注释。因此,代码应该是不言自明的。
此外,大约一半的代码行是由于 OP 的附加要求,例如%列或Totals行。
get_exclusive_counts <- function(DF) {
library(data.table)
library(magrittr)
# make copy of first 2 cols to preserve original attributes of DF
DT <- as.data.table(DF[, 1:2])
# retain original column names
old <- colnames(DT)[1:2]
# rename colnames in copy for convenience of programming
setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
# aggregate by id to find exclusive counts = ids with only one element
tmp <- DT[, .N, keyby = id][N == 1L]
# create table of exclusive counts by joining and aggregating
excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
# append column of proportions, will be formatted after computing Totals
.[, `%` := Exclusive_Count / sum(Exclusive_Count)]
# anti-join to find remaining rows
rem <- DT[!tmp, on = .(id)]
# create co-occurrence matrix in long format by a self-join
coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
# reshape to wide format and compute counts of co-occurrences w/o diagonals
dcast(val ~ i.val, length, subset = .(val != i.val))
# build final result table by merging both subresults
merge(excl, coocc, by = "val", all = TRUE) %>%
# replace NA counts by 0
.[, lapply(.SD, nafill, fill = 0L), by = val] %>%
# clean-up: order by decreasing Exclusive_Counts %>%
.[order(-Exclusive_Count)] %>%
# append Totals row
rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
# clean-up: format proportion as percentage
.[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
# clean-up: Replace <NA> by "Blank" in val column
.[is.na(val), val := "Blank"] %>%
# rename val column
setnames("val", old[2]) %>%
# return result visibly
.[]
}
Run Code Online (Sandbox Code Playgroud)
这是代码get_exclusive_counts_side_by_side():
get_exclusive_counts_side_by_side <- function(...) {
library(data.table)
library(magrittr)
# process input, return list of subresults
ec_list<- list(...) %>%
lapply(get_exclusive_counts)
# create row indices for maximum rows
rid <- ec_list %>%
lapply(nrow) %>%
Reduce(max, .) %>%
{data.table(.rowid = 1:.)}
# combine subresults
ec_list %>%
# insert empty rows if necessary
lapply(function(.x) .x[
, .rowid := .I][
# but align last row
.rowid == .N, .rowid := nrow(rid)][
rid, on =.(.rowid)][
, .rowid := NULL]
) %>%
# all data.tables have the same number of rows, now cbind()
do.call(cbind, .) %>%
# replace all NA by empty character strings
.[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
}
Run Code Online (Sandbox Code Playgroud)
如果我理解正确,独占计数是指仅分配了一个项目(或)的IDNA。这是相当直接的计算
此外,该功能处理 OP 的附加要求,这些要求超出了排他计数的识别:
NA零或"Blank",分别替换s 。DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)
DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
DF3 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}",
"I-8", "{X1,Y2,Y3}",
"I-9", "{Y2,Y4}",
"I10", "{X1}",
)
Run Code Online (Sandbox Code Playgroud)