给出两个数据框:
df1 = data.frame(CustomerId = c(1:6), Product = c(rep("Toaster", 3), rep("Radio", 3)))
df2 = data.frame(CustomerId = c(2, 4, 6), State = c(rep("Alabama", 2), rep("Ohio", 1)))
df1
# CustomerId Product
# 1 Toaster
# 2 Toaster
# 3 Toaster
# 4 Radio
# 5 Radio
# 6 Radio
df2
# CustomerId State
# 2 Alabama
# 4 Alabama
# 6 Ohio
Run Code Online (Sandbox Code Playgroud)
我怎样才能做数据库风格,即sql风格,加入?也就是说,我该怎么做:
我知道我们可以在通过引用创建列时动态添加列名称(使用:=),如下所述:Data.table 中的动态列名称。
但是,我希望在聚合时动态添加列名称。你能帮忙吗?
test_dtb <- data.table(a = sample(1:100, 100), b = sample(1:100, 100), id = rep(1:10, 10))
m = "blah"
test_dtb[ , list((m) = mean(b)), by = id]
Run Code Online (Sandbox Code Playgroud)
我得到的错误是
Error: unexpected '=' in "test_dtb[ , list((m) =
Run Code Online (Sandbox Code Playgroud) 我知道,有在这个论坛上如何获得汇总统计提供了很多的答案(例如,平均值,SE,N)为多组使用选项,如aggregate,ddply或data.table.但是,我不确定如何在多个列上同时应用这些函数.
更具体地说,我想知道如何ddply在多列(dv1,dv2,dv3)上扩展以下命令,而无需每次都重新键入具有不同变量名的代码.
library(reshape2)
library(plyr)
group1 <- c(rep(LETTERS[1:4], c(4,6,6,8)))
group2 <- c(rep(LETTERS[5:8], c(6,4,8,6)))
group3 <- c(rep(LETTERS[9:10], c(12,12)))
my.dat <- data.frame(group1, group2, group3, dv1=rnorm(24),dv2=rnorm(24),dv3=rnorm(24))
my.dat
data1 <- ddply(my.dat, c("group1", "group2","group3"), summarise,
N = length(dv1),
mean = mean(dv1,na.rm=T),
sd = sd(dv1,na.rm=T),
se = sd / sqrt(N)
)
data1
Run Code Online (Sandbox Code Playgroud)
如何ddply在多个列上应用此函数,以便结果将是每个结果变量的data1,data2,data3 ...?我认为这可能是解决方案:
dfm <- melt(my.dat, id.vars = c("group1", "group2","group3"))
lapply(list(.(group1, variable), .(group2, variable),.(group3, variable)),
ddply, .data = dfm, .fun = summarize,
mean = …Run Code Online (Sandbox Code Playgroud)