sha*_*ala 1 r mean dplyr data.table
假设下表
Name Gender Place Age V1
Tom M NY 24 A
Nadia F AT 22 A
Alex M DE 42 B
Jodie F OH 18 B
Tom M NY 28 B
Alex F ID 32 B
Nadia F AT 34 A
Tom M OH 18 A
Run Code Online (Sandbox Code Playgroud)
我想按名称和性别对表进行分组,使用连接列的多数投票替换地点和V1,使用数字均值替换年龄.结果应该是:
Name Gender Place Age V1
Tom M NY 23.3334 A
Nadia F AT 28 A
Alex M DE 42 B
Jodie F OH 18 B
Alex F ID 32 B
Run Code Online (Sandbox Code Playgroud)
Tom(M)有三个条目,其中NY为两次,OH为一次.按照多数票,NJ更经常被选中.与V1中的A相同.年龄(24,28和18)的平均值是23.3334.
我使用dplyr得到了数值均值:
dt <- dt %>%
group_by_(.dots=lapply(names(dt)[c(1, 2)], as.symbol)) %>%
summarise_each(funs(mean))
Run Code Online (Sandbox Code Playgroud)
并且可以在地点和V1分别进行多数投票:
dt$place<- dt[, names(which.max(table(place))), by = paste(name, gender)]
dt$V1 <- dt[, names(which.max(table(V1))), by = paste(name, gender)]
Run Code Online (Sandbox Code Playgroud)
我的问题是性能.我有一个非常大的数据集,这些修改在多个步骤中花费的时间太长.至少使用某种应用函数来一步完成多数投票会很棒.最好的方法是将多数投票添加到dplyr函数中.
我们创建了一个vector分组列名('grpCol'),用于setdiff获取其余的列名('nm1').循环(sapply)通过'nm1'列来检查这些列中的哪一列是'numeric'(is.numeric)以返回逻辑索引('indx').
grpCol <- c('Name', 'Gender')
nm1 <- setdiff(names(df1), grpCol)
indx <- sapply(df1[nm1], is.numeric)
Run Code Online (Sandbox Code Playgroud)
我们还创建了一个Mode函数来返回最大频率的元素.
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Run Code Online (Sandbox Code Playgroud)
将'data.frame'转换为'data.table'(setDT(df1)),按'grpCol'分组,我们使用'indx'循环遍历Data.table(.SD)子集的子集以返回mean数字列和mode非数据列数字列,concatenate(c)以获得预期的输出.
setDT(df1)[,c(lapply(.SD[, names(indx)[indx], with=FALSE], mean),
lapply(.SD[, names(indx)[!indx], with=FALSE], Mode)) ,
by = grpCol]
# Name Gender Age Place V1
#1: Tom M 23.33333 NY A
#2: Nadia F 28.00000 AT A
#3: Alex M 42.00000 DE B
#4: Jodie F 18.00000 OH B
#5: Alex F 32.00000 ID B
Run Code Online (Sandbox Code Playgroud)
或者正如评论中提到的@Frank,我们可以if/else在内lapply而不是创建'indx'.
setDT(df1)[, lapply(.SD, function(x) {if(is.numeric(x)) mean(x)
else Mode(x)} ), by=.(Name,Gender)]
# Name Gender Place Age V1
#1: Tom M NY 23.33333 A
#2: Nadia F AT 28.00000 A
#3: Alex M DE 42.00000 B
#4: Jodie F OH 18.00000 B
#5: Alex F ID 32.00000 B
Run Code Online (Sandbox Code Playgroud)
df1 <- structure(list(Name = c("Tom", "Nadia", "Alex", "Jodie", "Tom",
"Alex", "Nadia", "Tom"), Gender = c("M", "F", "M", "F", "M",
"F", "F", "M"), Place = c("NY", "AT", "DE", "OH", "NY", "ID",
"AT", "OH"), Age = c(24L, 22L, 42L, 18L, 28L, 32L, 34L, 18L),
V1 = c("A", "A", "B", "B", "B", "B", "A", "A")), .Names = c("Name",
"Gender", "Place", "Age", "V1"), class = "data.frame",
row.names = c(NA, -8L))
Run Code Online (Sandbox Code Playgroud)