data.frame的大字符串向量

sop*_*hia 2 r rcpp stringr data.table

我有一个大型矢量(100M元素)的单词类型:

words <- paste(letters,letters,letters,letters,sep="_")
Run Code Online (Sandbox Code Playgroud)

(实际数据中的单词并不完全相同,但全长为8)

我想将它们转换为一个数据框,每个字母的每个字母都有一列,每个字有一行.对于此我试图str_split_fixedrbind该结果,但是,从大矢量R冻结/永远需要.

如此期望的形式输出:

      l1    l2    l3    l4
1     a     a     a     a  
2     b     b     b     b
3     c     c     c     c
Run Code Online (Sandbox Code Playgroud)

有更快的方法吗?

npj*_*pjc 7

解:

  • 用于paste()将矢量元素折叠在一起
  • 用于fread()将折叠的矢量解析为data.table/data.frame

作为一个功能:

collapse2fread <- function(x,sep) {

    require(data.table)
    fread(paste0(x,collapse="\n"),sep=sep,header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)

Rcpp在那之上?

也可以尝试通过Rcpp包在c ++中实现它以获得更多的东西吗?就像是:

std::string collapse_cpp(CharacterVector subject,const std::string collapseBy){

     int n = subject.size();
     std::string collapsed;

     for(int i=0;i<n;i++){
         collapsed += std::string(subject[i]) + collapseBy;
    }
    return(collapsed);
}
Run Code Online (Sandbox Code Playgroud)

然后我们得到:

collapse_cpp2fread <- function(x,sep) {

    require(data.table)
    fread(collapse_cpp(x,collapse="\n"),sep=sep,header=FALSE)
}
Run Code Online (Sandbox Code Playgroud)

快速测试cpp fxn

microbenchmark(
    paste0(words,collapse="\n"),
    collapse_cpp(words,"\n"),
    times=100)
Run Code Online (Sandbox Code Playgroud)

不多但是有点儿:

> Unit: microseconds
>                             expr   min     lq median     uq    max neval
>  paste0(words, collapse = "\\n") 7.297 7.7695  8.162 8.4255 33.824   100
>       collapse_cpp(words, "\\n") 4.477 5.0095  5.117 5.3525 17.052   100
Run Code Online (Sandbox Code Playgroud)

与strsplit方法的比较:

做一个更真实的输入

words <- rep(paste0(letters[1:8], collapse = '_'), 1e5) # 100K elements
Run Code Online (Sandbox Code Playgroud)

风向标:

microbenchmark(
    do.call(rbind, strsplit(words, '_')),
    fread(paste0(words,collapse="\n"),sep="_",header=FALSE),
    fread(collapse_cpp(words,"\n"),sep="_",header=FALSE),
    times=10)
Run Code Online (Sandbox Code Playgroud)

得到:

> Unit: milliseconds
>                                                               expr       min        lq    median                  uq
>                               do.call(rbind, strsplit(words, "_")) 782.71782 796.19154 822.73694 854.22211
> fread(paste0(words, collapse = "\\n"), sep = "_", header = FALSE)  62.56164  64.13504  68.22512  71.96075
> fread(collapse_cpp(words, "\\n"), sep = "_", header = FALSE)  47.16362  47.78030  50.12867  52.23102
>      max neval
> 863.0790    10
> 151.5969    10
> 109.9770    10
Run Code Online (Sandbox Code Playgroud)

这么大约20倍的改进?希望能帮助到你!