jal*_*pic 7 r gsub reshape2 grepl
我正在寻找一些关于数据重组的建议.我正在使用Google Forms收集一些数据,我将其作为csv文件下载,如下所示:
# alpha beta option
# 6 8, 9, 10, 11 apple
# 9 6 pear
# 1 6 apple
# 3 8, 9 pear
# 3 6, 8 lime
# 3 1 apple
# 2, 4, 7, 11 9 lime
Run Code Online (Sandbox Code Playgroud)
数据有两个变量(alpha和beta),每个变量都列出数字.对于我的大多数数据,每个变量中只有一个数字.但是,对于某些观察,可以有两个,三个甚至十个数字.这是因为这些是使用谷歌表单中的"复选框"选项收集的回复,它允许对一个调查问题的多个答案.此外,对于一些潜在的解决方案来说,google表单在每个多个答案之前返回前导空格可能很重要.
在我的实际数据中,这只发生在所有观察的很小一部分中,上面是一个更简洁的例子.数据集中还有其他几个变量.在这里,我只包括一个包含因子的"选项".
我需要做的是在'alpha'或'beta'变量中复制包含多个数字的所有观察.重复行的数量应该等于alpha或beta变量中存在的数字的数量.然后,我需要将'alpha'或'beta'变量中的数字序列单独替换为每个数字.这将导致类似以下内容:
# alpha beta option
# 6 8 apple
# 6 9 apple
# 6 10 apple
# 6 11 apple
# 9 6 pear
# 1 6 apple
# 3 8 pear
# 3 9 pear
# 3 6 lime
# 3 8 lime
# 3 1 apple
# 2 9 lime
# 4 9 lime
# 7 9 lime
# 11 9 lime
Run Code Online (Sandbox Code Playgroud)
以下是再现上述原始示例数据的数据.我已经调用了数据帧'demo':
demo<-structure(list(alpha = structure(c(4L, 5L, 1L, 3L, 3L, 3L, 2L), .Label =
c("1","2, 4, 7, 11", "3", "6", "9"), class = "factor"), beta = structure(c(5L, 2L, 2L,
4L, 3L, 1L, 6L), .Label = c("1", "6", "6, 8", "8, 9", "8, 9, 10, 11", "9"), class =
"factor"), option = structure(c(1L, 3L, 1L, 3L, 2L, 1L, 2L), .Label = c("apple",
"lime", "pear"), class = "factor")), .Names = c("alpha", "beta", "option"), class =
"data.frame", row.names = c(NA, -7L))
Run Code Online (Sandbox Code Playgroud)
好.所以我想我已经编写了一些代码,这些代码以非常冗长的方式确实导致了我正在寻找的新数据帧.然而,感觉必须有更优雅和更好的方式.
基本上,我首先处理'alpha'变量.我首先根据变量中是否存在逗号来对观察进行子集化.通过包含逗号的观察,我然后使用strsplit来分隔数字.然后我计算每个观察数量存在多少个数量,并通过它重复每个观察.然后,我将拆分数字融合到一个数据框中,所有数字都在一个名为"value"的变量中.然后我简单地将'alpha'变量替换为融化的'value'变量中的数据.然后我用不包含逗号的数据重新绑定它.然后我使用这个df并处理'beta'变量....
这是我的解决方案(似乎有用吗?):
library(reshape2)
demo$a<-grepl(",", demo$alpha)
demo.atrue <- demo[ which(demo$a=='TRUE'), ]
demo.afalse <- demo[ which(demo$a=='FALSE'), ]
demo.atrue$alpha<-as.character(demo.atrue$alpha)
temp<-strsplit(demo.atrue$alpha, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded <- demo.atrue[rep(row.names(demo.atrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded$alpha<-temp.melt$value
demo.afalse<-demo.afalse[c(1:3)]
demonew<-rbind(demo.afalse, df.expanded)
demonew$b<-grepl(",", demonew$beta)
demonew.btrue <- demonew[ which(demonew$b=='TRUE'), ]
demonew.bfalse <- demonew[ which(demonew$b=='FALSE'), ]
demonew.btrue$beta<-as.character(demonew.btrue$beta)
temp<-strsplit(demonew.btrue$beta, ",")
temp.lengths<-lapply(temp, length)
for (i in 1:length(temp)) {
df.expanded1 <- demonew.btrue[rep(row.names(demonew.btrue), temp.lengths), 1:3]
}
temp.melt<-melt(temp)
df.expanded1$beta<-temp.melt$value
demonew.bfalse<-demonew.bfalse[c(1:3)]
demonew1<-rbind(df.expanded1, demonew.bfalse)
demonew1 #this seems to work, but doesn't feel very efficient
Run Code Online (Sandbox Code Playgroud)
同样可能效率不高,我不确定这是否适用于所有条件.特别是如果同一观察的'alpha'和'beta'变量都存在多个数字.我用几个例子对它进行了测试,看起来没问题,但我对它没有信心.
谢谢你的任何考虑.
您可以使用我的cSplit函数,嵌套两次,如下所示:
cSplit(cSplit(demo, "alpha", ",", "long"), "beta", ",", "long")
# alpha beta option
# 1: 6 8 apple
# 2: 6 9 apple
# 3: 6 10 apple
# 4: 6 11 apple
# 5: 9 6 pear
# 6: 1 6 apple
# 7: 3 8 pear
# 8: 3 9 pear
# 9: 3 6 lime
# 10: 3 8 lime
# 11: 3 1 apple
# 12: 2 9 lime
# 13: 4 9 lime
# 14: 7 9 lime
# 15: 11 9 lime
Run Code Online (Sandbox Code Playgroud)
一些基准:
更有趣的样本数据.700行而不是7行(仍然是一个非常小的数据集)......
demo <- do.call(rbind, replicate(100, demo, FALSE))
library(data.table)
demo2 <- data.table(demo)
Run Code Online (Sandbox Code Playgroud)
测试功能......
## MrFlick's
fun1 <- function() {
do.call(rbind, with(demo, Map(expand.grid,
alpha = strsplit(alpha,", "),
beta = strsplit(beta, ", "),
option = option
)))
}
## Mine
fun2 <- function() {
cSplit(cSplit(demo2, "alpha", ",", "long"), "beta", ",", "long")
}
## thelatemail's one-liner
fun3 <- function() {
do.call(rbind,do.call(Map, c(expand.grid, lapply(demo, strsplit, ", "))))
}
Run Code Online (Sandbox Code Playgroud)
实际的基准......
library(microbenchmark)
microbenchmark(MF = fun1(), AM = fun2(), TH = fun3(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# MF 785.34875 789.94924 800.11046 800.93643 813.62390 10
# AM 11.54569 11.93483 12.14181 12.31329 12.93208 10
# TH 790.46069 799.68518 803.47294 827.69520 899.11219 10
Run Code Online (Sandbox Code Playgroud)
实际上这不应该太糟糕.首先,为简单起见,我将把所有列转换为字符,以便稍后拆分
demo[] <- lapply(demo, as.character)
Run Code Online (Sandbox Code Playgroud)
现在让我们努力工作吧.基本上我会在","分隔符上拆分"alpha"和"beta"列.然后我会expand.grid用来组合"alpha","beta"和"option"的所有元素.这将负责重复必要的行,并且如果"alpha"和"beta"都具有多个值,则将起作用.最后,我将所有新生成的行重新组合成一个很棒的大数据帧.这是代码
do.call(rbind, with(demo, Map(expand.grid,
alpha = strsplit(alpha,", "),
beta = strsplit(beta, ", "),
option = option
)))
Run Code Online (Sandbox Code Playgroud)
就是这样.它会回来
alpha beta option
1 6 8 apple
2 6 9 apple
3 6 10 apple
4 6 11 apple
5 9 6 pear
6 1 6 apple
7 3 8 pear
8 3 9 pear
9 3 6 lime
10 3 8 lime
11 3 1 apple
12 2 9 lime
13 4 9 lime
14 7 9 lime
15 11 9 lime
Run Code Online (Sandbox Code Playgroud)