将dplyr :: mutate与R中的lubridate :: ymd_hms组合在一起会导致段错误

Joh*_*all 7 r segmentation-fault lubridate dplyr

我尽可能地搜索与此相关的内容,但在SO或dplyr github上找不到任何内容; 可能是一个新问题,因为下面的代码在今天之前运行良好?

问题在概念上很简单:my_data %>% mutate(x = ymd_hms(x))有时调用,但不总是(即随机调用)导致R因捕获的段错误而崩溃.我已经将问题简化为最简单的形式(也在这里:https://gist.github.com/john-sandall/05c3abb24fc738ddc2ad):

require(lubridate)
require(dplyr)

set.seed(42)
make_some_random_datetimes = function(n) ymd("2015-01-01") + seconds(runif(n, min=0, max=60*60*24*365))

d = data.frame(
  col1 = make_some_random_datetimes(5000),
  col2 = make_some_random_datetimes(5000)
)

do_it = function() {
  d %>% mutate(
    col1 = ymd_hms(col1),
    col2 = ymd_hms(col2)  # for some reason, it only crashes when evaluating 2+ cols, if we removed this line it'd be fine
  )
  return(TRUE)
}

do_it()  # doesn't crash every time...it fails every nth time where n is randomly distributed with mean of roughly 7.7

do_it_lots_of_times = function(n) for (i in 1:n) do_it()

do_it_lots_of_times(50)  # almost guaranteed to fail on my machine
Run Code Online (Sandbox Code Playgroud)

所以在某些时候,运行do_it()上面会导致段错误,在终端中运行R的输出是

*** caught segfault ***
address 0x0, cause 'unknown'
Run Code Online (Sandbox Code Playgroud)

我今天早上升级到R版本3.2.1,虽然回滚到3.2.0并重新安装库没有帮助.然后我尝试卸载/重新安装R(使用brew install r完全更新/升级的自制程序),然后重新安装上面所有必需的包.这是sessionInfo()的输出:

R version 3.2.1 (2015-06-18)
Platform: x86_64-apple-darwin14.3.0 (64-bit)
Running under: OS X 10.10.3 (Yosemite)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.4.2     lubridate_1.3.3

loaded via a namespace (and not attached):
[1] lazyeval_0.1.10 R6_2.0.1        assertthat_0.1  magrittr_1.5    plyr_1.8.3      parallel_3.2.1 
[7] DBI_0.3.1       tools_3.2.1     memoise_0.2.1   Rcpp_0.11.6     stringi_0.5-2   digest_0.6.8   
[13] stringr_1.0.0  
Run Code Online (Sandbox Code Playgroud)

作为一名统计学家并且没有想法,我决定查看失败率的分布情况,看看这是否有助于解决问题.如果do_it()在第n次运行时遇到崩溃,我写下了n在50次崩溃的情况下(例如第3次尝试,然后是第7次尝试),我得到了这个序列:

3, 7, 9, 20, 9, 9, 9, 7, 4, 23, 6, 3, 3, 3, 7, 7, 3, 9, 6, 6, 7, 10, 13, 7, 3, 7, 4, 7, 9, 6, 7, 7, 6, 6, 7, 7, 7, 9, 6, 12, 7, 7, 5, 9, 18, 6, 7, 9, 9, 7
Run Code Online (Sandbox Code Playgroud)

这给了我这个分布:

我不知道这是否相关或有帮助,虽然我注意到的另一件事是将数据帧中的行数d从5000增加到10000似乎将n的平均值从~8增加到~20.

任何有关这方面的帮助都将非常受欢迎!

Joh*_*all 6

90%肯定这是最新版dplyr(0.4.2)中的错误,请参阅此处的问题:https://github.com/hadley/dplyr/issues/1231

将我的dplyr版本降级到0.4.1,如下所示可以解决问题:

packageurl = "http://cran.r-project.org/src/contrib/Archive/dplyr/dplyr_0.4.1.tar.gz"
install.packages(packageurl, repos=NULL, type="source", dependencies = TRUE)
Run Code Online (Sandbox Code Playgroud)