sop*_*hia 2 r rcpp stringr data.table
我有一个大型矢量(100M元素)的单词类型:
words <- paste(letters,letters,letters,letters,sep="_")
Run Code Online (Sandbox Code Playgroud)
(实际数据中的单词并不完全相同,但全长为8)
我想将它们转换为一个数据框,每个字母的每个字母都有一列,每个字有一行.对于此我试图str_split_fixed与rbind该结果,但是,从大矢量R冻结/永远需要.
如此期望的形式输出:
l1 l2 l3 l4
1 a a a a
2 b b b b
3 c c c c
Run Code Online (Sandbox Code Playgroud)
有更快的方法吗?
paste()将矢量元素折叠在一起 fread()将折叠的矢量解析为data.table/data.frame 作为一个功能:
collapse2fread <- function(x,sep) {
require(data.table)
fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)
也可以尝试通过Rcpp包在c ++中实现它以获得更多的东西吗?就像是:
std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){
int n = subject.size();
std::string collapsed;
for(int i=0;i<n;i++){
collapsed += std::string(subject[i]) + collapseBy;
}
return(collapsed);
}
Run Code Online (Sandbox Code Playgroud)
然后我们得到:
collapse_cpp2fread <- function(x,sep) {
require(data.table)
fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)
microbenchmark(
paste0(words,collapse="\n"),
collapse_cpp(words,"\n"),
times=100)
Run Code Online (Sandbox Code Playgroud)
不多但是有点儿:
> Unit: microseconds
> expr min lq median uq max neval
> paste0(words, collapse = "\\n") 7.297 7.7695 8.162 8.4255 33.824 100
> collapse_cpp(words, "\\n") 4.477 5.0095 5.117 5.3525 17.052 100
Run Code Online (Sandbox Code Playgroud)
做一个更真实的输入
words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements
Run Code Online (Sandbox Code Playgroud)
风向标:
microbenchmark(
do.call(rbind, strsplit(words, '_')),
fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
times=10)
Run Code Online (Sandbox Code Playgroud)
得到:
> Unit: milliseconds
> expr min lq median uq
> do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE) 62.56164 64.13504 68.22512 71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE) 47.16362 47.78030 50.12867 52.23102
> max neval
> 863.0790 10
> 151.5969 10
> 109.9770 10
Run Code Online (Sandbox Code Playgroud)
这么大约20倍的改进?希望能帮助到你!
| 归档时间: |
|
| 查看次数: |
545 次 |
| 最近记录: |