有效地计算跨多个列的字符串的出现次数

Mar*_*ler 3 r dataframe

我有列的大数据帧(> 4万行)yname1,yname2,yname3该存储串:

yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee
Run Code Online (Sandbox Code Playgroud)

现在我想计算所有列中每个字符串的出现总次数.这些应作为附加列添加:

yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa |       3 |       1 |       1
aaabbb | cccccc | aaaaaa |       2 |       2 |       3
aaaaaa | aaabbb | dddddd |       3 |       2 |       2
cccccc | dddddd | eeeeee |       2 |       2 |       1
Run Code Online (Sandbox Code Playgroud)

我已经编写了以下代码,它完成了这项工作:

data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
Run Code Online (Sandbox Code Playgroud)

但是,这确实很慢,需要花费数天才能计算出来.我有什么想法可以加快速度吗?

Jot*_*ota 6

方法怎么样data.table:

library(data.table)
setDT(d)

lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
#    value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1

d[, paste0("rcount", 1:3) :=
   lapply(d, function(x) lookup[x, , on = .(value)][, N])]

#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa       3       1       1
#2: aaabbb cccccc aaaaaa       2       2       3
#3: aaaaaa aaabbb dddddd       3       2       2
#4: cccccc dddddd eeeeee       2       2       1
Run Code Online (Sandbox Code Playgroud)

Microbenchmark输出复制来自bgoldst的例子,但有400,000行.

Unit: seconds
          expr       min        lq      mean    median        uq       max neval
   bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697     3  
 alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850     3  
      jota(dt)  2.629337  2.692613  2.719207  2.755889  2.764141  2.772394     3   
    mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702     3
   coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334     3 
  espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306     3
Run Code Online (Sandbox Code Playgroud)


ali*_*ire 6

在基础R中,您可以构建一个包含data.frame的未列出值的表,并按值对其进行索引.确保你索引的是一个字符串,而不是一个因子(因此是as.character),或者它将被数字而不是名称索引.

data.frame(df, 
           lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
#   yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa    3      1      1
# 2 aaabbb cccccc aaaaaa    2      2      3
# 3 aaaaaa aaabbb dddddd    3      2      2
# 4 cccccc dddddd eeeeee    2      2      1
Run Code Online (Sandbox Code Playgroud)

如果data.frame足够大以至于速度很慢,那么您可以在表之外构建表,lapply因此它只运行一次:

df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))
Run Code Online (Sandbox Code Playgroud)

你也可以把它放进去dplyr,这使它更具可读性:

       # look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>%   # or mutate_each(funs(df_table[as.character(.)]))
    # fix column names
    select(rcount = starts_with('yname')) %>% 
    # add original df back in
    bind_cols(df, .)
# Source: local data frame [4 x 6]
# 
#   yname1 yname2 yname3 rcount1 rcount2 rcount3
#   (fctr) (fctr) (fctr)  (tabl)  (tabl)  (tabl)
# 1 aaaaaa bbbaaa bbaaaa       3       1       1
# 2 aaabbb cccccc aaaaaa       2       2       3
# 3 aaaaaa aaabbb dddddd       3       2       2
# 4 cccccc dddddd eeeeee       2       2       1
Run Code Online (Sandbox Code Playgroud)

数据

df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
    ), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa", 
    "aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2", 
    "yname3"), row.names = c(NA, -4L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)


bgo*_*dst 5

已经有一些很好的解决方案,但没有一个用于match()在预先计算的频率表中查找每个字符串.以下是如何做到这一点.请注意,我选择为参数和第一个参数as.matrix()生成yname*列的矩阵.table()match()

cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
##   yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa       3       1       1
## 2 aaabbb cccccc aaaaaa       2       2       3
## 3 aaaaaa aaabbb dddddd       3       2       2
## 4 cccccc dddddd eeeeee       2       2       1
Run Code Online (Sandbox Code Playgroud)

更新:我不敢相信我以前错过了这个,但表达方式

cnts[match(m,names(cnts))]
Run Code Online (Sandbox Code Playgroud)

可以替换为

cnts[m]
Run Code Online (Sandbox Code Playgroud)

所以match()根本没有必要打电话.

我只是重新评估基准测试,发现它并没有以任何显着的方式改变我的解决方案的运行时间(可能只是在小规模测试中略微加速).大概这是因为索引带有字符名称的向量需要match()内部使用相同类型的逻辑,因此上述替换不会获得任何性能.但我会说简洁和简洁的改进是值得的.


标杆

我应该注意到,我对其他一些解决方案进行了一些小的修改,以便产生这些基准测试结果.最值得注意的是,我想避免为重复执行复制任何输入,但由于data.tables通过引用传递,我必须修改jota()以使其成为幂等的.这涉及仅对目标yname*列进行过滤,我将其预先计算为cns通过grep()调用调用的局部变量,就像我在自己的解决方案中一样.为了公平起见,我grep()所有解决方案添加了相同的调用和过滤逻辑,但markus()不需要它,因为它分别显式处理每个列.我也改变了对指数联接操作lookupjota()lookup[.(value=x),,on='value'],因为它不是为我工作,否则.最后,为此mhairi(),我通过Reduce()在所有yname*列中添加对合并的调用来完成解决方案.

library(microbenchmark);
library(data.table);
library(dplyr);

bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
Run Code Online (Sandbox Code Playgroud)
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
##           expr      min        lq      mean    median       uq      max neval
##    bgoldst(df)  491.373  544.6165  597.4743  575.8350  609.192 2054.872   100
##     markus(df)  375.907  435.5645  463.7258  467.4250  489.022  549.962   100
##  alistaire(df)  754.380  816.1755  849.8749  840.3385  888.021  959.654   100
##       jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930   100
##     mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045   100
##    coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441   100
Run Code Online (Sandbox Code Playgroud)
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);

ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE

microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
##           expr        min         lq       mean     median         uq        max neval
##    bgoldst(df)   85.20766   87.00487   88.39154   88.80209   89.98348   91.16487     3
##     markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390     3
##  alistaire(df)   83.03348   83.10276   83.18116   83.17204   83.25500   83.33797     3
##       jota(dt)   12.49174   13.82088   14.44939   15.15002   15.42821   15.70640     3
##     mhairi(df)  156.06459  156.36608  158.27256  156.66758  159.37654  162.08551     3
##    coffein(df)  154.02853  154.97215  156.52246  155.91576  157.76942  159.62309     3
Run Code Online (Sandbox Code Playgroud)