如何解决R估计中的整数溢出错误

Jam*_*mes 6 estimation r computation speedglm

我正在尝试使用R中的speedglm来估计模型.数据集很大(约6988万行和38列).乘以行数和列数导致约27亿,超出整数限制.我无法提供数据,但以下示例重新创建了该问题.

library(speedglm)

# large example that works 
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
working.example <- speedglm(fo, data = da, family = Gamma(log))

# repeat with large enough size to break 
k <- 5000       # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
failed.example <- speedglm(fo, data = da, family = Gamma(log))

# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
Run Code Online (Sandbox Code Playgroud)

这会导致错误和整数溢出警告.

Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n,  :  
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow 
Run Code Online (Sandbox Code Playgroud)

我理解警告,但我不明白错误.在这种情况下,它们似乎是相关的,因为它们在每次尝试后一起出现.

删除列可以完成估算.删除哪些列似乎并不重要; 删除交互或非交互变量都将导致完成估计.该CHUNKSIZE最初收到该错误后添加选项,但并没有帮助.

我的问题是:(1)导致第一个错误的原因是什么?(2)有没有办法用数据估计模型,使得行数乘以行数大于整数限制?(3)是否有更好的na.action在这种情况下使用?

谢谢,

J.P.

跑步:R版本3.3.3(2017-03-06)

实际代码如下:

dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0", 
             "ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650", 
             "FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex", 
             "Row", "Modular", "Rural", "FirstTimeBuyer", 
             "FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio", 
             "VintageBefore2001", "NFLD", "yoy.fcpwti:province_n") 
logit1 <- speedglm(formula = paste("DefaultFlag ~ ", 
                                   paste(dft_var, collapse = "+"), 
                                   sep = ""), 
                   family = binomial(logit), 
                   na.action = na.exclude, 
                   data = default.data,
                   chunksize = 1*10^7)
Run Code Online (Sandbox Code Playgroud)

And*_*lin 5

更新:

根据我在下面的调查,@James 发现可以通过在函数调用中NULL为参数提供非值来避免该问题,因为它阻止了函数的内部调用。sparsespeedglmis.sparse

使用上面的示例,现在应该可以执行以下操作:

speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
Run Code Online (Sandbox Code Playgroud)

我的原答案:

警告和错误都来自is.sparse包中函数的同一行speedglm

该行是:

sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)
Run Code Online (Sandbox Code Playgroud)

发生警告是因为使用nrow(X)*ncol(X)了大型矩阵。在nrowncol函数返回integer值,这可能会溢出。这是一个插图。

nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
Run Code Online (Sandbox Code Playgroud)

发生错误sample是因为当 X 是一个大矩阵并且 时函数被混淆了size = NA。这是一个插图:

sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n,  : 
# missing value where TRUE/FALSE needed
Run Code Online (Sandbox Code Playgroud)