在R中应用函数会导致内存分配错误

Question

在R中应用函数会导致内存分配错误

我试图在R中的数据帧的每一行上进行计算,并将计算作为新列添加到帧中.我开始使用"by"函数,但是计算速度非常慢,所以我转而使用"apply"函数.我想象它的工作方式是使用我的函数运行apply,将输出保存到变量并将该数据附加到原始数据框.

我创建了一个函数来计算保险计划的期限长度并返回该值,该值在样本数据集上正常工作.当我使用我更大的数据集时,我收到"无法分配大小的矢量......"的错误.我知道很多人建议获得更多的内存,但我已经拥有16GB的内存并且整个数据集已加载到R我的计算机中说它只使用了7.7GB的内存.这个数据集有44列,有大约1100万条记录,所以我没看到如何再添加一列数据占用8GB内存？

朝着正确方向的任何一点都会很棒.

以下是我正在使用的功能:

get_term_length <- function(row_data){

    # convert values to dates
    expiration_date <- as.Date( row_data[42] )
    start_date <- as.Date( row_data[43] )
    cancellation_date <- as.Date( row_data[44] )

    # check to see if the cancellation date is NA - just use entire policy length
    if( is.na(cancellation_date) ){
        return( expiration_date - start_date) )
    }

    # check to see if policy was cancelled early
    if(cancellation_date < expiration_date){
        return( cancellation_date - start_date )
    }

    # the policy was for the entire term
    else{
        return( expiration_date - start_date )
    }

}

Run Code Online (Sandbox Code Playgroud)

我一直在通过调用来运行该函数:

tmp <- apply(policy_data, 1, get_term_length)

Run Code Online (Sandbox Code Playgroud)

Answer 1

mne*_*nel 5

data.table@Dwin暗示的解决方案

 library(data.table)
 policy_data <- as.data.table(policy_data)

  # set the date  columns to be  IDate (the exact form of this will depend
  # on the format they are currently in

  policy_data[, cancellation_date := as.IDate(cancellation_date)]
  policy_data[, start_date := as.IDate(start_date)]
  policy_data[, end_date := as.IDate(end_date)]
  # create a column which is an indicator for NA 

  policy_data[, isna := is.na(cancellation_date)]


  setkey(policy_data, isna)

  policy_data[J(TRUE), tmp := expiration_date - start_date]
  policy_data[J(FALSE), tmp := pmin(cancellation_date - start_date, expiration_date-start_date)]

Run Code Online (Sandbox Code Playgroud)

归档时间：	13 年，6 月前
查看次数：	1218 次
最近记录：	10 年，2 月前