R data.table,列数可变

pau*_*n32 3 r data.table

对于数据集中的每个学生,可能已经收集了一组特定的分数.我们想要计算每个学生的平均值,但只使用与该学生密切相关的列中的分数.

计算中所需的列对于每行是不同的.我已经想过如何使用常用工具在R中编写这个,但我试图用data.table重写,部分是为了好玩,但也部分是为了预期这个小项目的成功,这可能导致需要进行计算很多很多行.

这是一个"为每行问题选择特定列集"的小工作示例.

set.seed(123234)
## Suppose these are 10 students in various grades
dat <- data.frame(id = 1:10, grade = rep(3:7, by = 2),
              A = sample(c(1:5, 9), 10,  replace = TRUE),
              B = sample(c(1:5, 9), 10, replace = TRUE),
              C = sample(c(1:5, 9), 10, replace = TRUE),
              D = sample(c(1:5, 9), 10, replace = TRUE))
## 9 is a marker for missing value, there might also be
## NAs in real data, and those are supposed to be regarded
## differently in some exercises

## Students in various grades are administered different
## tests.  A data structure gives the grade to test linkage.
## The letters are column names in dat
lookup <- list("3" = c("A", "B"),
           "4" = c("A", "C"),
           "5" = c("B", "C", "D"),
           "6" = c("A", "B", "C", "D"),
           "7" = c("C", "D"),
           "8" = c("C"))

## wrapper around that lookup because I kept getting confused
getLookup <- function(grade){
    lookup[[as.character(grade)]]
}


## Function that receives one row (named vector)
## from data frame and chooses columns and makes calculation
getMean <- function(arow, lookup){
    scores <- arow[getLookup(arow["grade"])]
    mean(scores[scores != 9], na.rm = TRUE)
}

stuscores <- apply(dat, 1, function(x) getMean(x, lookup))

result <- data.frame(dat, stuscores)
result

## If the data is 1000s of thousands of rows,
## I will wish I could use data.table to do that.

## Client will want students sorted by state, district, classroom,
## etc.

## However, am stumped on how to specify the adjustable
## column-name chooser

library(data.table)
DT <- data.table(dat)
## How to write call to getMean correctly?
## Want to do this for each participant (no grouping)
setkey(DT, id)
Run Code Online (Sandbox Code Playgroud)

所需的输出是相应列的学生平均值,如下所示:

> result
  id grade A B C D stuscores
1   1     3 9 9 1 4       NaN
2   2     4 5 4 1 5       3.0
3   3     5 1 3 5 9       4.0
4   4     6 5 2 4 5       4.0
5   5     7 9 1 1 3       2.0
6   6     3 3 3 4 3       3.0
7   7     4 9 2 9 2       NaN
8   8     5 3 9 2 9       2.0
9   9     6 2 3 2 5       3.0
10 10     7 3 2 4 1       2.5
Run Code Online (Sandbox Code Playgroud)

那又怎样?到目前为止我写了很多错误......

我没有在数据表示例中找到任何示例,其中用于计算每行的列本身就是一个变量,我感谢您的建议.

我没有要求任何人为我编写代码,我正在征求关于如何开始解决这个问题的建议.

Dav*_*urg 6

首先,当使用诸如sample(每次运行时设置随机种子)等功能创建可重现的示例时,您应该使用set.seed.

其次,不是循环遍历每一行,而是可以循环遍历lookup列表,该列表总是小于数据(多次显着缩小)并将其组合rowMeans.你也可以使用base R来做,但是你要求一个data.table解决方案,所以这里(为了这个解决方案的目的,我已经将所有9转换为NAs,但你也可以尝试将其概括为你的特定情况)

所以使用set.seed(123),你的功能给出

apply(dat, 1, function(x) getMean(x, lookup))
# [1] 2.000000 5.000000 4.666667 4.500000 2.500000 1.000000 4.000000 2.333333 2.500000 1.500000
Run Code Online (Sandbox Code Playgroud)

这里有一个可能的data.table应用程序只在lookup列表for上运行(列表中的循环在R btw中非常有效,请参见此处)

## convert all 9 values to NAs
is.na(dat) <- dat == 9L 
## convert your original data to `data.table`, 
## there is no need in additional copy of the data if the data is huge
setDT(dat)     
## loop only over the list
for(i in names(lookup)) {
  dat[grade == i, res := rowMeans(as.matrix(.SD[, lookup[[i]], with = FALSE]), na.rm = TRUE)]
}
dat
#     id grade  A  B  C  D      res
#  1:  1     3  2 NA NA NA 2.000000
#  2:  2     4  5  3  5 NA 5.000000
#  3:  3     5  3  5  4  5 4.666667
#  4:  4     6 NA  4 NA  5 4.500000
#  5:  5     7 NA  1  4  1 2.500000
#  6:  6     3  1 NA  5  3 1.000000
#  7:  7     4  4  2  4  5 4.000000
#  8:  8     5 NA  1  4  2 2.333333
#  9: NA     6  4  2  2  2 2.500000
# 10: 10     7  3 NA  1  2 1.500000
Run Code Online (Sandbox Code Playgroud)

可能,这可以改善利用set,但我想不出目前的好方法.


PS

正如@Arun建议,请看看他自己写的短文这里以熟悉的:=操作,.SD,with = FALSE,等.