获取向量中整数频率的最快方法是什么?

Mus*_*ful 6 r frequency histogram ecdf

有没有一种简单快速的方法来获得R中整数向量中出现的每个整数的频率?

以下是我到目前为止的尝试:

x <- floor(runif(1000000)*1000)

print('*** using TABLE:')
system.time(as.data.frame(table(x)))

print('*** using HIST:')
system.time(hist(x,breaks=min(x):(max(x)+1),plot=FALSE,right=FALSE))

print('*** using SORT')
system.time({cdf<-cbind(sort(x),seq_along(x)); cdf<-cdf[!duplicated(cdf[,1]),2]; c(cdf[-1],length(x)+1)-cdf})

print('*** using ECDF')
system.time({i<-min(x):max(x); cdf<-ecdf(x)(i)*length(x); cdf-c(0,cdf[-length(i)])})

print('*** counting in loop')
system.time({h<-rep(0,max(x)+1);for(i in seq_along(x)){h[x[i]]<-h[x[i]]+1}; h})

#print('*** vectorized summation') #This uses too much memory if x is large
#system.time(colSums(matrix(rbind(min(x):max(x))[rep(1,length(x)),]==x,ncol=max(x)-min(x)+1)))

#Note: There are some fail cases in some of the above methods that need patching if, for example, there is a chance that some integer bins are unoccupied
Run Code Online (Sandbox Code Playgroud)

以下是结果:

[1] "*** using TABLE:"
   user  system elapsed 
   1.26    0.03    1.29 
[1] "*** using HIST:"
   user  system elapsed 
   0.11    0.00    0.10 
[1] "*** using SORT"
   user  system elapsed 
   0.22    0.02    0.23 
[1] "*** using ECDF"
   user  system elapsed 
   0.17    0.00    0.17 
[1] "*** counting in loop"
   user  system elapsed 
   3.12    0.00    3.12 
Run Code Online (Sandbox Code Playgroud)

正如你所看到的那样table,速度非常慢,hist似乎是最快的.但是hist(因为我正在使用它)正在研究任意可指定的断点,而我只是想要整数.难道没有办法交换这种灵活性以获得更好的性能吗?

C中,for(i=0;i<1000000;i++)h[x[i]]++;速度会非常快.

Jos*_*ich 7

最快的是使用tabulate但它需要正整数作为输入,因此您必须进行快速单调转换.

set.seed(21)
x <- as.integer(runif(1e6)*1000)
system.time({
  adj <- 1L - min(x)
  y <- setNames(tabulate(x+adj), sort(unique(x)))
})
Run Code Online (Sandbox Code Playgroud)


Joe*_*Joe 5

别忘了您可以在R中内联C ++代码。

library(inline)

src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
int test = max(xa);
Rcpp::NumericVector xab(test);
for (int i = 0; i < n_xa; i++)
xab[xa[i]-1]++;
return xab;
'
fun <- cxxfunction(signature(a = "numeric"),src, plugin = "Rcpp")
Run Code Online (Sandbox Code Playgroud)