我想从这个数据集创建虚拟变量:
DF<-structure(list(A = c(1, 2, 3, 4, 5), B = c("1,3,2", "2,1,3,6",
"3,2,5,1,7", "3,7,4,2,6,5", "4,10,7,3,5,6")), .Names = c("A", "B"),
row.names = c(NA, 5L), class = "data.frame")
> DF
A B
1 1 1,3,2
2 2 2,1,3,6
3 3 3,2,5,1,7
4 4 3,7,4,2,6,5
5 5 4,10,7,3,5,6
Run Code Online (Sandbox Code Playgroud)
期望的输出应该如下所示:
A 1 2 3 4 5 6 7 8 9 10
1 1 1 1 0 0 0 0 0 0 0
2 1 1 1 0 0 1 0 0 0 0
3 1 1 1 0 1 0 1 0 0 0
4 0 1 1 1 1 1 1 0 0 0
5 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
有没有一种有效的方法来做这样的事情?我可以用strsplit或ifelse.原始数据集非常大,有许多行(> 10k),B列中的值(> 15k).dummy包中的功能dummies不能正常工作.
我还发现了simmilar案例:将一列拆分成多列.但是我的情况下上面链接的工作真的很慢(我的戴尔i7-2630QM,8Gb,Win7 64位,R 2.15.3 64位)最多15分钟.
提前谢谢你的导师.
A5C*_*2T1 18
此处提到的功能现已转移到CRAN上可用的名为"splitstackshape"的软件包中.CRAN上的版本比原始版本快得多.速度应与for本答案末尾的直接循环解决方案相似.有关详细基准,请参阅@ Ricardo的答案.
安装它,并用于concat.split.expanded获得所需的结果:
library(splitstackshape)
concat.split.expanded(DF, "B", fill = 0, drop = TRUE)
# A B_01 B_02 B_03 B_04 B_05 B_06 B_07 B_08 B_09 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
原帖
不久前,我写过一个函数,不只是这种分裂,而是其他.这个名为的函数concat.split()可以在这里找到.
对于您的示例数据,用法将是:
## Keeping the original column
concat.split(DF, "B", structure="expanded")
# A B B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1,3,2 1 1 1 NA NA NA NA NA NA NA
# 2 2 2,1,3,6 1 1 1 NA NA 1 NA NA NA NA
# 3 3 3,2,5,1,7 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 3,7,4,2,6,5 NA 1 1 1 1 1 1 NA NA NA
# 5 5 4,10,7,3,5,6 NA NA 1 1 1 1 1 NA NA 1
## Dropping the original column
concat.split(DF, "B", structure="expanded", drop.col=TRUE)
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 NA NA NA NA NA NA NA
# 2 2 1 1 1 NA NA 1 NA NA NA NA
# 3 3 1 1 1 NA 1 NA 1 NA NA NA
# 4 4 NA 1 1 1 1 1 1 NA NA NA
# 5 5 NA NA 1 1 1 1 1 NA NA 1
Run Code Online (Sandbox Code Playgroud)
将NA重新编码为0必须手动完成 - 也许我会更新函数以添加一个选项来执行此操作,同时,实现其中一个更快的解决方案:)
temp <- concat.split(DF, "B", structure="expanded", drop.col=TRUE)
temp[is.na(temp)] <- 0
temp
# A B_1 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9 B_10
# 1 1 1 1 1 0 0 0 0 0 0 0
# 2 2 1 1 1 0 0 1 0 0 0 0
# 3 3 1 1 1 0 1 0 1 0 0 0
# 4 4 0 1 1 1 1 1 1 0 0 0
# 5 5 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
concat.split函数中的大部分开销可能包括从a转换matrix为a data.frame,重命名列等等.用于进行拆分的实际代码是GASP for循环,但测试它,你会发现它表现得非常好:
b = strsplit(DF$B, ",")
ncol = max(as.numeric(unlist(b)))
temp = lapply(b, as.numeric)
## Set up an empty matrix
m = matrix(0, nrow = nrow(DF), ncol = ncol)
## Fill it in
for (i in 1:nrow(DF)) {
m[i, temp[[i]]] = 1
}
## View your result
m
Run Code Online (Sandbox Code Playgroud)
在
Update2下添加了基准:为@ Anada的解决方案添加了bechmarks.哇它快!
为更大的数据集增加了基准,@ Anada的解决方案以更大的利润率提前加速."
原来的答案:正如你可以看到下面,KnownMax和UnknownMax被超越甚至data.table解决方案.虽然,我怀疑如果有10e6 +行,那么data.table解决方案将是最快的.(可以通过简单地修改本文最底部的参数来对其进行基准测试)
KnownMax如果你知道B中的最大值,那么你有一个很好的双线:
maximum <- 10
results <- t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 1 1 1 0 0 0 0 0 0 0
# [2,] 1 1 1 0 0 1 0 0 0 0
# [3,] 1 1 1 0 1 0 1 0 0 0
# [4,] 0 1 1 1 1 1 1 0 0 0
# [5,] 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
三行,如果要命名列和行:
dimnames(results) <- list(seq(nrow(results)), seq(ncol(results)))
Run Code Online (Sandbox Code Playgroud)
UnknownMax# if you do not know the maximum ahead of time:
splat <- strsplit(DF$B, ",")
maximum <- max(as.numeric(unlist(splat)))
t(sapply(splat, `%in%`, x=1:maximum)) + 0
Run Code Online (Sandbox Code Playgroud)
DT根据@ dickoa的要求,这里有一个选项data.table."
DT <- data.table(DF)
DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]
cols <- DT.long[, max(vals)]
rows <- DT.long[, max(A)]
matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols,
byrow=TRUE, dimnames=list(seq(rows), seq(cols)))
# 1 2 3 4 5 6 7 8 9 10
# 1 1 1 1 0 0 0 0 0 0 0
# 2 1 1 1 0 0 1 0 0 0 0
# 3 1 1 1 0 1 0 1 0 0 0
# 4 0 1 1 1 1 1 1 0 0 0
# 5 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
类似的设置可以在基地完成R,以及
===
以下是一些数据略大的基准测试:
microbenchmark(KnownMax = eval(KnownMax), UnknownMax = eval(UnknownMax),
DT.withAssign = eval(DT.withAssign),
DT.withOutAssign = eval(DT.withOutAssign),
lapply.Dickoa = eval(lapply.Dickoa), apply.SimonO101 = eval(apply.SimonO101),
forLoop.Ananda = eval(forLoop.Ananda), times=50L)
Run Code Online (Sandbox Code Playgroud)
使用OP data.frame,结果为5 x 10
Unit: microseconds
expr min lq median uq max neval
KnownMax 106.556 114.692 122.4915 129.406 6427.521 50
UnknownMax 114.470 122.561 128.9780 136.384 158.346 50
DT.withAssign 3000.777 3099.729 3198.8175 3291.284 10415.315 50
DT.withOutAssign 2637.023 2739.930 2814.0585 2903.904 9376.747 50
lapply.Dickoa 7031.791 7315.781 7438.6835 7634.647 14314.687 50
apply.SimonO101 430.350 465.074 487.9505 522.938 7568.442 50
forLoop.Ananda 81.415 91.027 99.7530 104.588 265.394 50
Run Code Online (Sandbox Code Playgroud)
使用稍大data.frame(下同),其中的结果为1000×100
移除lapply.Dickoa作为我的编辑可能已经放慢下来,因为它站在坠毁.
Unit: milliseconds
expr min lq median uq max neval
KnownMax 34.83210 35.59068 36.13330 38.15960 52.27746 50
UnknownMax 36.41766 37.17553 38.03075 47.71438 55.57009 50
DT.withAssign 31.95005 32.65798 33.73578 43.71493 50.05831 50
DT.withOutAssign 31.36063 32.08138 32.80728 35.32660 51.00037 50
apply.SimonO101 78.61677 91.72505 95.53592 103.36052 163.14346 50
forLoop.Ananda 13.61827 14.02197 14.18899 14.58777 26.42266 50
Run Code Online (Sandbox Code Playgroud)
甚至更大的设置,其结果是10,000 x 600
Unit: milliseconds
expr min lq median uq max neval
KnownMax 1583.5902 1631.6214 1658.6168 1724.9557 1902.3923 50
UnknownMax 1597.1215 1655.9634 1690.7550 1735.5913 1804.2156 50
DT.withAssign 586.4675 641.7206 660.7330 716.0100 1193.4806 50
DT.withOutAssign 587.0492 628.3731 666.3148 717.5575 776.2671 50
apply.SimonO101 1916.6589 1995.2851 2044.9553 2079.6754 2385.1028 50
forLoop.Ananda 163.4549 172.5627 182.6207 211.9153 315.0706 50
Run Code Online (Sandbox Code Playgroud)
使用以下内容:
library(microbmenchmark)
library(data.table)
KnownMax <- quote(t(sapply(strsplit(DF$B, ","), `%in%`, x=1:maximum)) + 0)
UnknownMax <- quote({ splat <- strsplit(DF$B, ","); maximum <- max(as.numeric(unlist(splat))); t(sapply(splat, `%in%`, x=1:maximum)) + 0})
DT.withAssign <- quote({DT <- data.table(DF); DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
DT.withOutAssign <- quote({DT.long <- DT[, list(vals=as.numeric(unlist(strsplit(B, ",")))), by=A]; cols <- DT.long[, max(vals)]; rows <- DT.long[, max(A)] ; matrix(as.numeric(DT.long[, (1:cols) %in% vals, by=A]$V1), ncol=cols, byrow=TRUE, dimnames=list(seq(rows), seq(cols)))})
lapply.Dickoa <- quote({ tmp <- strsplit(DF$B, ","); label <- 1:max(as.numeric(unlist(tmp))); tmp <- lapply(tmp, function(x) as.data.frame(lapply(label, function(y) (x == y)))); unname(t(sapply(tmp, colSums))) })
apply.SimonO101 <- quote({cols <- 1:max( as.numeric( unlist(strsplit(DF$B,",")))); t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) ) })
forLoop.Ananda <- quote({b = strsplit(DF$B, ","); ncol = max(as.numeric(unlist(b))); temp = lapply(b, as.numeric); m = matrix(0, nrow = nrow(DF), ncol = ncol) ; for (i in 1:nrow(DF)) { m[i, temp[[i]]] = 1 }; m })
# slightly modified @Dickoa's alogrithm to allow for instances were B is only a single number.
# Instead of using `sapply(.)`, I used `as.data.frame(lapply(.))` which hopefully the simplification process in sapply is analogous in time to `as.data.frame`
identical(eval(lapply.Dickoa), eval(UnknownMax))
identical(eval(lapply.Dickoa), unname(eval(apply.SimonO101)))
identical(eval(lapply.Dickoa), eval(KnownMax))
identical(unname(as.matrix(eval(DT.withAssign))), eval(KnownMax))
# ALL TRUE
Run Code Online (Sandbox Code Playgroud)
这是用于创建示例数据的内容:
# larger data created as follows
set.seed(1)
maximum <- 600
rows <- 10000
DF <- data.frame(A=seq(rows), B=as.character(c(maximum, replicate(rows-1, paste(sample(maximum, sample(20), FALSE), collapse=",")))), stringsAsFactors=FALSE)
DT <- data.table(DF);
DT
Run Code Online (Sandbox Code Playgroud)
你可以用一种方法ifelse和strsplit(除非我误解,你不希望使用他们?)是这样的....
cols <- 1:max( as.numeric( unlist(strsplit(DF$B,","))))
df <- t(apply(DF["B"] , 1 , function(x) ifelse( cols %in% as.numeric( unlist( strsplit(x , ",") ) ) , 1 , 0 ) ) )
colnames(df) <- cols
df
# 1 2 3 4 5 6 7 8 9 10
#1 1 1 1 0 0 0 0 0 0 0
#2 1 1 1 0 0 1 0 0 0 0
#3 1 1 1 0 1 0 1 0 0 0
#4 0 1 1 1 1 1 1 0 0 0
#5 0 0 1 1 1 1 1 0 0 1
Run Code Online (Sandbox Code Playgroud)
我们的想法是,我们在您想要的列中获取唯一值的向量,找到max值并创建一个向量1:max(value)然后应用于每一行,以找出该行的哪些值在所有值的向量中.ifelse如果它在那里我们使用1和0如果不存在则使用0.将vector在我们的比赛是一个序列,因此其输出为准备进行排序.