如何在R中创建分布矩阵

Sop*_*son 5 r dataframe dplyr tidyverse

我在 R 中有以下数据框。

ID         Date                  List             Type
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      TR1
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      RES
P-10012    2020-04-15 12:13:15   ABC,ABD,BCD      FTT
P-10013    2020-04-12 17:10:05                    TR1
P-10013    2020-04-12 17:10:05                    FTT
P-10013    2020-04-12 17:10:05                    ZXR
P-10014    2020-04-10 04:30:19   ABD,BCD          TR1
P-10014    2020-04-10 04:30:19   ABD,BCD          ZXR
P-10015    2020-04-10 14:13:15   ABC              
P-10016    2020-04-10 13:13:15   
P-10017    2020-03-18 10:13:15   ABC,ABD,BCD      TR1



dput(df)
df <- structure(list(ID = c("P-10012", "P-10012", 
"P-10012", "P-10013", "P-10013", "P-10013", 
"P-10014", "P-10014", "P-10015", "P-10016", 
"P-10017"), Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", 
"2020-04-15 12:13:15", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
"2020-04-12 17:10:05", "2020-04-10 04:30:19", "2020-04-10 04:30:19", 
"2020-04-10 14:13:15", "2020-04-10 13:13:15", "2020-03-18 10:13:15"
), Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"), List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", 
"", "", "", "ABD,BCD", "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), class = "data.frame", row.names = c(NA, 
-11L))
Run Code Online (Sandbox Code Playgroud)

数据帧的结构是,它总是有相同的List特定值ID的情况下,如果有针对特定可多行ID,因为它具有多种不同的价值Type。如果对于一个特定的值ID只有 1 个,Type那么它总是有一行。

我需要创建的月份分布如下Apr-20List值作为逗号分隔提到的方式和Type价值观。

其中,第7行,我Required Df是重复计数ID根据条件(即是否List或者Type是不是空白的)所有独特的分布ListType价值。对于这 7 行,Distinct_Count应该除以Total得到Percentage。但是,从第 8 行开始,如果唯一值是形式List,则应除以总不同计数,Non_Blank_List如果该值来自Type,则应除以总不同计数Non_Blank_Type

在下面的矩阵中,我只想了解List和的独特值的分布是什么,并Type与其他值相结合。

请注意,出于示例目的,我分别简化了3 和 4 个唯一值中的ListType值,但在我的实际数据框中,它非常高,并且每个月都不同,因此请不要对这些值进行硬编码。

我尝试了多种方法,但仍无法实现所需的输出。

所需的 Df<-

APR-21           Distinct_Count    Percentage    ABC     ABD      BCD     TR1    RES    FTT     ZXR
Total_ID         5                 100.00%       2       2        2       3      1      2       2 
Blank_List       2                 40.00%        0       0        0       1      0      1       1
Blank_Type       2                 40.00%        1       0        0       0      0      0       0
Both_Blank       1                 20.00%        0       0        0       0      0      0       0
Non_Blank_List   3                 60.00%        2       2        2       2      1      1       1         
Non_Blank_Type   3                 60.00%        1       2        2       3      1      1       2
Both_Non_Blank   2                 40.00%        1       2        2       2      1      1       1
ABC              1                 33.33%        2       1        1       1      1      1       0
ABD              0                  0.00%        1       2        2       2      1      1       1
BCD              0                  0.00%        1       2        2       2      1      1       1
TR1              0                  0.00%        1       2        2       3      1      1       1 
RES              0                  0.00%        1       1        1       1      1      1       0    
FTT              0                  0.00%        1       1        1       2      1      2       1 
ZXR              0                  0.00%        0       1        1       1      0      1       2
Run Code Online (Sandbox Code Playgroud)

Dan*_*iel 3

最大的挑战是键按行和列排列。

我使用了 2 个自定义函数来计算出现次数:

两者的工作方式基本相同。我们分别计算所有单个案例的计数和总数,始终对ID结果进行分组和连接。

这是代码:

library(tidyverse)
library(lubridate)

df <- structure(list(ID = c("P-10012", "P-10012", "P-10012", "P-10013", "P-10013", "P-10013", 
                            "P-10014", "P-10014", "P-10015", "P-10016", "P-10017"), 
                     Date = c("2020-04-15 12:13:15", "2020-04-15 12:13:15", "2020-04-15 12:13:15", 
                              "2020-04-12 17:10:05", "2020-04-12 17:10:05", "2020-04-12 17:10:05", 
                              "2020-04-10 04:30:19", "2020-04-10 04:30:19", "2020-04-10 14:13:15", 
                              "2020-04-10 13:13:15", "2020-03-18 10:13:15"), 
                     Type = c("TR1", "RES", "FTT", "TR1", "FTT", "ZXR", "TR1", "ZXR", NA, NA, "TR1"),
                     List = c("ABC,ABD,BCD", "ABC,ABD,BCD", "ABC,ABD,BCD", "", "", "", "ABD,BCD", 
                              "ABD,BCD", "ABC", "", "ABC,ABD,BCD")), 
                class = "data.frame", row.names = c(NA, -11L))

#extract all the individual values from Type and List
cases = c(df$Type, str_split(df$List, ", ?", simplify=TRUE)) %>% unique() %>% 
  sort() %>% .[.!=""] %>% rlang::set_names()

#util function
is_blank = function(x) is.na(x) | x==""

#get count for summary rows (TotalID, Blank_list, ...)
getcount = function(cond){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any({{cond}} & (str_detect(Type, .x) | str_detect(List, .x)))) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any({{cond}})) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(x_tot, x)
}

#get count for cases rows (ABC, BCD, TR1...)
getcount2 = function(key){
  x = map_dbl(cases, ~df %>% 
                filter(month(Date)==4) %>%
                group_by(ID) %>% 
                summarise(rtn=any(
                  (key %in% Type  | str_detect(List, key)) &
                    (str_detect(Type, .x ) | str_detect(List, .x ))
                )) %>% 
                pull() %>% sum(na.rm=TRUE)
  )
  x_tot = df %>% 
    filter(month(Date)==4) %>% 
    group_by(ID) %>% 
    summarise(rtn=any(List==key)) %>% 
    pull() %>% sum(na.rm=TRUE)
  
  c(tot=x_tot, x)
}


#here we go!
tibble(x=c("Distinct_Count", cases)) %>% 
  mutate(
    Total_ID=getcount(TRUE),
    Blank_List=getcount(is_blank(List)),
    Blank_Type=getcount(is_blank(Type)),
    Blank_Both=getcount(is_blank(List) & is_blank(Type)),
    Non_Blank_List=getcount(!is_blank(List)),
    Non_Blank_Type=getcount(!is_blank(Type)),
    Non_Blank_Both=getcount(!is_blank(List) & !is_blank(Type))
  ) %>% 
  bind_cols(map_dfc(cases, ~getcount2(.x))) %>% 
  column_to_rownames("x") %>% 
  t() %>% as.data.frame() %>% 
  mutate(Percentage = scales::percent(Distinct_Count/max(Distinct_Count)), .after="Distinct_Count")
#>                Distinct_Count Percentage ABC ABD BCD FTT RES TR1 ZXR
#> Total_ID                    5       100%   2   2   2   2   1   3   2
#> Blank_List                  2        40%   0   0   0   1   0   1   1
#> Blank_Type                  2        40%   1   0   0   0   0   0   0
#> Blank_Both                  1        20%   0   0   0   0   0   0   0
#> Non_Blank_List              3        60%   2   2   2   1   1   2   1
#> Non_Blank_Type              3        60%   1   2   2   2   1   3   2
#> Non_Blank_Both              2        40%   1   2   2   1   1   2   1
#> ABC                         1        20%   2   1   1   1   1   1   0
#> ABD                         0         0%   1   2   2   1   1   2   1
#> BCD                         0         0%   1   2   2   1   1   2   1
#> FTT                         0         0%   1   1   1   2   1   2   1
#> RES                         0         0%   1   1   1   1   1   1   0
#> TR1                         0         0%   1   2   2   2   1   3   2
#> ZXR                         0         0%   0   1   1   1   0   2   2
Run Code Online (Sandbox Code Playgroud)

由reprex 包于 2021 年 5 月 12 日创建(v2.0.0)

请注意,与您的预期输出有一些细微的差异,但我认为它们与您的复杂示例相比只是小错误。例如,ABC 的 Distinct_Count==1 因此在 5 中它不应达到 33%。此外,ZXR 可以与 TR1 一起出现两次(ID 13 和 14)。