创建一个函数,用一个data.frame替换来自另一个data.frame的值

JD *_*ong 17 r na

我经常遇到需要从data.frame中替换缺失值的情况,其中一些其他data.frame的值处于不同的聚合级别.因此,例如,如果我有一个充满县数据的data.frame,我可能会将NA值替换为存储在另一个data.frame中的状态值.写完相同的merge... ifelse(is.na())yada yada几十次后我决定分解并写一个函数来做到这一点.

这是我做的东西,以及我如何使用它的一个例子:

fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
 mergedDf <- merge(naDf, fillDf, by=mergeCols)
 for (col in fillCols){
   colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
   colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
   k <- which( is.na( colWithNas ) )
   colWithNas[k] <- colWithOutNas[k]
   mergedDf[col] <- colWithNas
   mergedDf[[paste(col, "x", sep=".")]] <- NULL
   mergedDf[[paste(col, "y", sep=".")]] <- NULL
 }
 return(mergedDf)
}

## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
Run Code Online (Sandbox Code Playgroud)

所以在我开始运行后,我有一种奇怪的感觉,有人可能在我面前以更优雅的方式解决了这个问题.是否有更好/更容易/更快的解决方案来解决这个问题?另外,有没有一种方法可以消除函数中间的循环?那个循环是因为我经常在多个列中替换NA.而且,是的,函数假定我们填充柱命名相同,而列,我们正在填补这同样适用于合并.

任何指导或重构都会有所帮助.

12月2日的编辑我意识到我修复了我的例子中的逻辑缺陷.

Jos*_*ien 14

真是个好问题.

这是一个data.table解决方案:

# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))


# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]    
#      a b  f  g f.1 g.1
# [1,] 1 3 NA  0 100  11
# [2,] 1 3 NA NA 100  11
# [3,] 1 3 NA  0 100  11
# [4,] 1 3  0  0 100  11
# [5,] 1 3  0 NA 100  11
# First 5 rows of 200 printed.

# In outDT[i, j], on the following two lines 
#   -- i is a Boolean vector indicating which rows will be operated on
#   -- j is an expression saying "(sub)assign from right column (e.g. f.1) to 
#        left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]

# Just keep the four columns ultimately needed   
outDT <- outDT[,list(a,b,g,f)]
#       a b  g   f
#  [1,] 1 3  0   0
#  [2,] 1 3 11   0
#  [3,] 1 3  0   0
#  [4,] 1 3 11   0
#  [5,] 1 3 11   0
# First 5 rows of 200 printed.
Run Code Online (Sandbox Code Playgroud)


Jos*_*ich 6

这是您的方法稍微简洁/健壮的版本.您可以通过调用替换for循环lapply,但我发现循环更容易阅读.

此函数假定任何列不是mergeCols公平的游戏有他们来港填补.我不太确定这会有所帮助,但我会把我的机会与选民联系起来.

fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
  mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
  dataCols <- setdiff(names(naDf),mergeCols)
  # loop over all columns we didn't merge by
  for(col in dataCols) {
    rows <- is.na(mergedDf[,col])
    # skip this column if it doesn't contain any NAs
    if(!any(rows)) next
    rows <- which(rows)
    # replace NAs with values from fillDf
    mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
  }
  # don't return ".fill" columns
  mergedDf[,names(naDf)]
}
Run Code Online (Sandbox Code Playgroud)