我有列的大数据帧(> 4万行)yname1,yname2,yname3该存储串:
yname1 | yname2 | yname3
aaaaaa | bbbaaa | bbaaaa
aaabbb | cccccc | aaaaaa
aaaaaa | aaabbb | dddddd
cccccc | dddddd | eeeeee
Run Code Online (Sandbox Code Playgroud)
现在我想计算所有列中每个字符串的出现总次数.这些应作为附加列添加:
yname1 | yname2 | yname3 | rcount1 | rcount2 | rcount3
aaaaaa | bbbaaa | bbaaaa | 3 | 1 | 1
aaabbb | cccccc | aaaaaa | 2 | 2 | 3
aaaaaa | aaabbb | dddddd | 3 | 2 | 2
cccccc | dddddd | eeeeee | 2 | 2 | 1
Run Code Online (Sandbox Code Playgroud)
我已经编写了以下代码,它完成了这项工作:
data3$rcount1 <- sapply(data3$yname1, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount2 <- sapply(data3$yname2, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
data3$rcount3 <- sapply(data3$yname3, function(x) sum(data2$yname1==x)+sum(data2$yname2==x)+sum(data2$yname3==x))
Run Code Online (Sandbox Code Playgroud)
但是,这确实很慢,需要花费数天才能计算出来.我有什么想法可以加快速度吗?
方法怎么样data.table:
library(data.table)
setDT(d)
lookup <- melt(d, measure.vars = paste0("yname", 1:3))[, .N, by = value]
# value N
#1: aaaaaa 3
#2: aaabbb 2
#3: cccccc 2
#4: bbbaaa 1
#5: dddddd 2
#6: bbaaaa 1
#7: eeeeee 1
d[, paste0("rcount", 1:3) :=
lapply(d, function(x) lookup[x, , on = .(value)][, N])]
# yname1 yname2 yname3 rcount1 rcount2 rcount3
#1: aaaaaa bbbaaa bbaaaa 3 1 1
#2: aaabbb cccccc aaaaaa 2 2 3
#3: aaaaaa aaabbb dddddd 3 2 2
#4: cccccc dddddd eeeeee 2 2 1
Run Code Online (Sandbox Code Playgroud)
Microbenchmark输出复制来自bgoldst的例子,但有400,000行.
Unit: seconds
expr min lq mean median uq max neval
bgoldst(df) 21.445961 21.628228 21.876051 21.810496 22.091096 22.371697 3
alistaire(df) 20.685357 20.961761 21.255457 21.238164 21.540507 21.842850 3
jota(dt) 2.629337 2.692613 2.719207 2.755889 2.764141 2.772394 3
mhairi(df) 40.780441 41.048345 41.669798 41.316249 42.114476 42.912702 3
coffein(df) 35.669630 35.678719 36.453257 35.687808 36.845071 38.002334 3
espresso(df) 20.823840 20.976175 21.317218 21.128509 21.563907 21.999306 3
Run Code Online (Sandbox Code Playgroud)
在基础R中,您可以构建一个包含data.frame的未列出值的表,并按值对其进行索引.确保你索引的是一个字符串,而不是一个因子(因此是as.character),或者它将被数字而不是名称索引.
data.frame(df,
lapply(df, function(x){data.frame(table(unlist(df))[as.character(x)])['Freq']})
)
# yname1 yname2 yname3 Freq Freq.1 Freq.2
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
Run Code Online (Sandbox Code Playgroud)
如果data.frame足够大以至于速度很慢,那么您可以在表之外构建表,lapply因此它只运行一次:
df_table <- table(unlist(df))
data.frame(df, lapply(df, function(x){data.frame(df_table[as.character(x)])['Freq']}))
Run Code Online (Sandbox Code Playgroud)
你也可以把它放进去dplyr,这使它更具可读性:
# look up times repeated
df %>% mutate_each(funs(table(unlist(df))[as.character(.)])) %>% # or mutate_each(funs(df_table[as.character(.)]))
# fix column names
select(rcount = starts_with('yname')) %>%
# add original df back in
bind_cols(df, .)
# Source: local data frame [4 x 6]
#
# yname1 yname2 yname3 rcount1 rcount2 rcount3
# (fctr) (fctr) (fctr) (tabl) (tabl) (tabl)
# 1 aaaaaa bbbaaa bbaaaa 3 1 1
# 2 aaabbb cccccc aaaaaa 2 2 3
# 3 aaaaaa aaabbb dddddd 3 2 2
# 4 cccccc dddddd eeeeee 2 2 1
Run Code Online (Sandbox Code Playgroud)
df <- structure(list(yname1 = c("aaaaaa", "aaabbb", "aaaaaa", "cccccc"
), yname2 = c("bbbaaa", "cccccc", "aaabbb", "dddddd"), yname3 = c("bbaaaa",
"aaaaaa", "dddddd", "eeeeee")), .Names = c("yname1", "yname2",
"yname3"), row.names = c(NA, -4L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
已经有一些很好的解决方案,但没有一个用于match()在预先计算的频率表中查找每个字符串.以下是如何做到这一点.请注意,我选择为参数和第一个参数as.matrix()生成yname*列的矩阵.table()match()
cns <- grep(value=T,'^yname',names(df));
m <- as.matrix(df[cns]);
cnts <- table(m);
df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df));
df;
## yname1 yname2 yname3 rcount1 rcount2 rcount3
## 1 aaaaaa bbbaaa bbaaaa 3 1 1
## 2 aaabbb cccccc aaaaaa 2 2 3
## 3 aaaaaa aaabbb dddddd 3 2 2
## 4 cccccc dddddd eeeeee 2 2 1
Run Code Online (Sandbox Code Playgroud)
更新:我不敢相信我以前错过了这个,但表达方式
cnts[match(m,names(cnts))]
Run Code Online (Sandbox Code Playgroud)
可以替换为
cnts[m]
Run Code Online (Sandbox Code Playgroud)
所以match()根本没有必要打电话.
我只是重新评估基准测试,发现它并没有以任何显着的方式改变我的解决方案的运行时间(可能只是在小规模测试中略微加速).大概这是因为索引带有字符名称的向量需要match()内部使用相同类型的逻辑,因此上述替换不会获得任何性能.但我会说简洁和简洁的改进是值得的.
我应该注意到,我对其他一些解决方案进行了一些小的修改,以便产生这些基准测试结果.最值得注意的是,我想避免为重复执行复制任何输入,但由于data.tables通过引用传递,我必须修改jota()以使其成为幂等的.这涉及仅对目标yname*列进行过滤,我将其预先计算为cns通过grep()调用调用的局部变量,就像我在自己的解决方案中一样.为了公平起见,我grep()向所有解决方案添加了相同的调用和过滤逻辑,但markus()不需要它,因为它分别显式处理每个列.我也改变了对指数联接操作lookup中jota()到lookup[.(value=x),,on='value'],因为它不是为我工作,否则.最后,为此mhairi(),我通过Reduce()在所有yname*列中添加对合并的调用来完成解决方案.
library(microbenchmark);
library(data.table);
library(dplyr);
bgoldst <- function(df) { cns <- grep(value=T,'^yname',names(df)); m <- as.matrix(df[cns]); cnts <- table(m); df[,paste0('rcount',seq_along(cns))] <- matrix(cnts[match(m,names(cnts))],nrow(df)); df; };
markus <- function(df) { df$rcount1 <- sapply(df$yname1, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount2 <- sapply(df$yname2, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df$rcount3 <- sapply(df$yname3, function(x) sum(df$yname1==x)+sum(df$yname2==x)+sum(df$yname3==x)); df; };
alistaire <- function(df) { cns <- grep(value=T,'^yname',names(df)); df_table <- table(unlist(df[cns])); data.frame(df[cns],lapply(df[cns],function(x){data.frame(Freq=df_table[as.character(x)])})); };
jota <- function(dt) { cns <- grep(value=T,'^yname',names(df)); lookup <- melt(dt, measure.vars = cns)[, .N, by = value]; dt[, paste0("rcount", 1:3) := lapply(dt[,cns,with=F], function(x) lookup[.(value=x), , on = 'value'][, N])]; };
mhairi <- function(df) { cns <- grep(value=T,'^yname',names(df)); all_yname <-do.call(c,df[cns]); rcount <- as.data.frame(table(all_yname)); Reduce(function(df,cn) merge(df, rcount, by.x = cn, by.y = 'all_yname'),cns,df); };
coffein <- function(df) { cns <- grep(value=T,'^yname',names(df)); df2 <- melt(df[cns], id.vars = NULL); df2 <- df2 %>% group_by(value) %>% summarise(n=n()) %>% as.data.frame(); rownames(df2) <- df2$value; df2$value <- NULL; df$r1 <- df2[df$yname1,]; df$r2 <- df2[df$yname2,]; df$r3 <- df2[df$yname3,]; df; };
Run Code Online (Sandbox Code Playgroud)
## OP's test case
df <- data.frame(yname1=c('aaaaaa','aaabbb','aaaaaa','cccccc'),yname2=c('bbbaaa','cccccc','aaabbb','dddddd'),yname3=c('bbaaaa','aaaaaa','dddddd','eeeeee'),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,ex,y)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df));
## Unit: microseconds
## expr min lq mean median uq max neval
## bgoldst(df) 491.373 544.6165 597.4743 575.8350 609.192 2054.872 100
## markus(df) 375.907 435.5645 463.7258 467.4250 489.022 549.962 100
## alistaire(df) 754.380 816.1755 849.8749 840.3385 888.021 959.654 100
## jota(dt) 4143.955 4425.7785 4741.8354 4656.2835 4854.928 7347.930 100
## mhairi(df) 1938.122 2047.1740 2182.1841 2135.4850 2209.896 3969.045 100
## coffein(df) 1286.380 1430.9265 1546.3245 1511.3255 1562.430 3319.441 100
Run Code Online (Sandbox Code Playgroud)
## scale test
set.seed(1L);
NR <- 4e3L; NC <- 3L; SL <- 6L;
df <- as.data.frame(setNames(nm=paste0('yname',seq_len(NC)),replicate(NC,do.call(paste0,replicate(SL,sample(letters,NR,T),simplify=F)),simplify=F)),stringsAsFactors=F);
dt <- as.data.table(df);
ex <- bgoldst(df);
identical(ex,markus(df));
## [1] TRUE
identical(ex,{ y <- alistaire(df); names(y) <- names(ex); rownames(y) <- NULL; cis <- seq_along(df)+ncol(df); y[cis] <- lapply(y[cis],as.integer); y; });
## [1] TRUE
identical(ex,as.data.frame(jota(dt)));
## [1] TRUE
identical(ex,{ y <- mhairi(df); y <- y[c(cns,names(y)[!names(y)%in%cns])]; names(y) <- names(ex); y <- y[do.call(order,Map(match,y,ex)),]; rownames(y) <- NULL; y; });
## [1] TRUE
identical(ex,{ y <- coffein(df); names(y) <- names(ex); y; });
## [1] TRUE
microbenchmark(bgoldst(df),markus(df),alistaire(df),jota(dt),mhairi(df),coffein(df),times=3L);
## Unit: milliseconds
## expr min lq mean median uq max neval
## bgoldst(df) 85.20766 87.00487 88.39154 88.80209 89.98348 91.16487 3
## markus(df) 3771.08606 3788.97413 3799.08405 3806.86220 3813.08305 3819.30390 3
## alistaire(df) 83.03348 83.10276 83.18116 83.17204 83.25500 83.33797 3
## jota(dt) 12.49174 13.82088 14.44939 15.15002 15.42821 15.70640 3
## mhairi(df) 156.06459 156.36608 158.27256 156.66758 159.37654 162.08551 3
## coffein(df) 154.02853 154.97215 156.52246 155.91576 157.76942 159.62309 3
Run Code Online (Sandbox Code Playgroud)