如何对R strsplit进行矢量化?

Jam*_*mes 15 r vectorization strsplit

创建使用的函数时strsplit,矢量输入的行为不符合要求,sapply需要使用.这是由于产生的列表输出strsplit.有没有办法对流程进行矢量化 - 也就是说,函数会在列表中为输入的每个元素生成正确的元素?

例如,要计算字符向量中单词的长度:

words <- c("a","quick","brown","fox")

> length(strsplit(words,""))
[1] 4 # The number of words (length of the list)

> length(strsplit(words,"")[[1]])
[1] 1 # The length of the first word only

> sapply(words,function (x) length(strsplit(x,"")[[1]]))
a quick brown   fox 
1     5     5     3 
# Success, but potentially very slow
Run Code Online (Sandbox Code Playgroud)

理想情况下,像length(strsplit(words,"")[[.]])where 这样的东西.被解释为输入向量的相关部分.

Sha*_*ane 21

通常,您应该尝试使用矢量化函数开始.使用后strsplit经常会需要某种迭代(这将会更慢),所以尽可能避免使用它.在您的示例中,您应该使用nchar:

> nchar(words)
[1] 1 5 5 3
Run Code Online (Sandbox Code Playgroud)

更一般地说,利用strsplit返回列表并使用的事实lapply:

> as.numeric(lapply(strsplit(words,""), length))
[1] 1 5 5 3
Run Code Online (Sandbox Code Playgroud)

或者使用l*ply家庭功能plyr.例如:

> laply(strsplit(words,""), length)
[1] 1 5 5 3
Run Code Online (Sandbox Code Playgroud)

编辑:

为了纪念Bloomsday,我决定使用Joyce的Ulysses来测试这些方法的表现:

joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
joyce <- unlist(strsplit(joyce, " "))
Run Code Online (Sandbox Code Playgroud)

现在我已经掌握了所有的话,我们可以做到这一点:

> # original version
> system.time(print(summary(sapply(joyce, function (x) length(strsplit(x,"")[[1]])))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   2.65    0.03    2.73 
> # vectorized function
> system.time(print(summary(nchar(joyce))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
   0.05    0.00    0.04 
> # with lapply
> system.time(print(summary(as.numeric(lapply(strsplit(joyce,""), length)))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
    0.8     0.0     0.8 
> # with laply (from plyr)
> system.time(print(summary(laply(strsplit(joyce,""), length))))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   3.000   4.000   4.666   6.000  69.000 
   user  system elapsed 
  17.20    0.05   17.30
> # with ldply (from plyr)
> system.time(print(summary(ldply(strsplit(joyce,""), length))))
       V1        
 Min.   : 0.000  
 1st Qu.: 3.000  
 Median : 4.000  
 Mean   : 4.666  
 3rd Qu.: 6.000  
 Max.   :69.000  
   user  system elapsed 
   7.97    0.00    8.03 
Run Code Online (Sandbox Code Playgroud)

矢量化函数lapply比原始sapply版本快得多.所有解决方案都返回相同的答案(如摘要输出所示).

显然最新版本plyr更快(这是使用稍旧的版本).