Jos*_*e R 16 boolean r dataframe
我正在尝试将因子列转换为多个布尔列,如下图所示.使用精美weatherData
包裹检索数据来自气象站.我想要转换为多个布尔列的因子列包含11个因子.其中一些是单一的"事件",其中一些是"事件"的组合.
这是一个显示我想要实现的目标的图像:
这是R代码,它将生成具有组合因子的数据框,我想将其转换为几个布尔列:
df <- read.table(text="
date Events
1/8/2013 Rain
1/9/2013 Fog
1/10/2013 ''
1/11/2013 Fog-Rain
1/12/2013 Snow
1/13/2013 Rain-Snow
1/14/2013 Rain-Thunderstorm
1/15/2013 Thunderstorm
1/16/2013 Fog-Rain-Thunderstorm
1/17/2013 Fog-Thunderstorm
1/18/2013 Fog-Rain-Thunderstorm-Snow",
header=T)
df$date <- as.character(as.Date(df$date, "%m/%d/%Y"))
Run Code Online (Sandbox Code Playgroud)
提前致谢.
akr*_*run 11
你可以尝试:
lst <- strsplit(as.character(df$Events),"-")
lvl <- unique(unlist(lst))
res <- data.frame(date=df$date,
do.call(rbind,lapply(lst, function(x) table(factor(x, levels=lvl)))),
stringsAsFactors=FALSE)
res
# date Rain Fog Snow Thunderstorm
#1 2013-01-08 1 0 0 0
#2 2013-01-09 0 1 0 0
#3 2013-01-10 0 0 0 0
#4 2013-01-11 1 1 0 0
#5 2013-01-12 0 0 1 0
#6 2013-01-13 1 0 1 0
#7 2013-01-14 1 0 0 1
#8 2013-01-15 0 0 0 1
#9 2013-01-16 1 1 0 1
#10 2013-01-17 0 1 0 1
# 11 2013-01-18 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
或者可能,这可能比上面更快(由@alexis_laz贡献)
setNames(data.frame(df$date, do.call(rbind,lapply(lst, function(x) as.integer(lvl %in% x)) )), c("date", lvl))
Run Code Online (Sandbox Code Playgroud)
要么
library(devtools)
library(data.table)
source_gist("11380733")
library(reshape2) #In case it is needed
res1 <- dcast.data.table(cSplit(df, "Events", "-", "long"), date~Events)
res2 <- merge(subset(df, select=1), res1, by="date", all=TRUE)
res2 <- as.data.frame(res2)
res2[,-1] <- (!is.na(res2[,-1]))+0
res2[,c(1,3,2,4,5)]
# date Rain Fog Snow Thunderstorm
#1 2013-01-08 1 0 0 0
#2 2013-01-09 0 1 0 0
#3 2013-01-10 0 0 0 0
#4 2013-01-11 1 1 0 0
#5 2013-01-12 0 0 1 0
#6 2013-01-13 1 0 1 0
#7 2013-01-14 1 0 0 1
#8 2013-01-15 0 0 0 1
#9 2013-01-16 1 1 0 1
#10 2013-01-17 0 1 0 1
#11 2013-01-18 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
要么
library(qdap)
with(df, termco(Events, date, c("Rain", "Fog", "Snow", "Thunderstorm")))[[1]][,-2]
# date Rain Fog Snow Thunderstorm
#1 2013-01-08 1 0 0 0
#2 2013-01-09 0 1 0 0
#3 2013-01-10 0 0 0 0
#4 2013-01-11 1 1 0 0
#5 2013-01-12 0 0 1 0
#6 2013-01-13 1 0 1 0
#7 2013-01-14 1 0 0 1
#8 2013-01-15 0 0 0 1
#9 2013-01-16 1 1 0 1
#10 2013-01-17 0 1 0 1
#11 2013-01-18 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
我能想到的最简单的事情concat.split.expanded
来自我的"splitstackshape"软件包(devel版本1.3.0,来自GitHub).
## Get the right version of the package
library(devtools)
install_github("splitstackshape", "mrdwab", ref = "devel")
packageVersion("splitstackshape")
# [1] ‘1.3.0’
## Split up the relevant column
concat.split.expanded(df, "Events", "-", type = "character",
fill = 0, drop = TRUE)
# date Events_Fog Events_Rain Events_Snow Events_Thunderstorm
# 1 2013-01-08 0 1 0 0
# 2 2013-01-09 1 0 0 0
# 3 2013-01-10 0 0 0 0
# 4 2013-01-11 1 1 0 0
# 5 2013-01-12 0 0 1 0
# 6 2013-01-13 0 1 1 0
# 7 2013-01-14 0 1 0 1
# 8 2013-01-15 0 0 0 1
# 9 2013-01-16 1 1 0 1
# 10 2013-01-17 1 0 0 1
# 11 2013-01-18 1 1 1 1
Run Code Online (Sandbox Code Playgroud)
回答这个问题,我意识到我有点愚蠢地硬编码"修剪"功能,concat.split.expanded
这可能会减慢很多东西.如果您想要更快的方法,请直接在"事件"列的拆分版本上使用charMat
(被调用的函数concat.split.expanded
),如下所示:
splitstackshape:::charMat(
strsplit(as.character(indf[, "Events"]), "-", fixed = TRUE), fill = 0)
Run Code Online (Sandbox Code Playgroud)
对于某些基准测试,请查看此Gist.