假设我有一个单位数据集,可以将活动状态从活动状态更改为非活动状态.每次设备更改活动时,我想记录从活动切换到非活动状态.一个可重复的例子:
UNIT <- c(100,100, 200, 200, 200, 200, 200, 300, 300, 300,300)
STATUS <- c('ACTIVE','INACTIVE','ACTIVE','ACTIVE','INACTIVE','ACTIVE','INACTIVE','ACTIVE','ACTIVE',
'ACTIVE','INACTIVE')
TERMINATED <- c('1999-07-06' , '2008-12-05' , '2000-08-18' , '2000-08-18' ,'2000-08-18' ,'2008-08-18',
'2008-08-18','2006-09-19','2006-09-19' ,'2006-09-19' ,'1999-03-15')
START <- c('2007-04-23','2008-12-06','2004-06-01','2007-02-01','2008-04-19','2010-11-29','2010-12-30',
'2007-10-29','2008-02-05','2008-06-30','2009-02-07')
STOP <- c('2008-12-05','4712-12-31','2007-01-31','2008-04-18','2010-11-28','2010-12-29','4712-12-31',
'2008-02-04','2008-06-29','2009-02-06','4712-12-31')
DAT <- data.frame(UNIT,STATUS,TERMINATED,START,STOP)
DAT
UNIT STATUS TERMINATED START STOP
1 100 ACTIVE 1999-07-06 2007-04-23 2008-12-05
2 100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
3 200 ACTIVE 2000-08-18 2004-06-01 2007-01-31
4 200 ACTIVE 2000-08-18 2007-02-01 2008-04-18
5 200 INACTIVE 2000-08-18 2008-04-19 2010-11-28
6 200 ACTIVE 2008-08-18 2010-11-29 2010-12-29
7 200 INACTIVE 2008-08-18 2010-12-30 4712-12-31
8 300 ACTIVE 2006-09-19 2007-10-29 2008-02-04
9 300 ACTIVE 2006-09-19 2008-02-05 2008-06-29
10 300 ACTIVE 2006-09-19 2008-06-30 2009-02-06
11 300 INACTIVE 1999-03-15 2009-02-07 4712-12-31
Run Code Online (Sandbox Code Playgroud)
当一个单元的状态从ACTIVE变为INACTIVE时,这意味着该单元已被终止.不幸的是,记录的终止日期(TERMINATED)无效.有效终止日期是从活动切换到非活动后减去1天的有效开始日期(当STATUS == INACTIVE时).换句话说,先前活动记录的结束日期.例如,在单元100的情况下,第3行中的TERMINATED日期是正确的.但是,单元300的终止日期应为"2009-02-06".该解决方案应该足够强大,以便它理解单元200具有两个不活动的字符并相应地填充.
我不知道在哪里开始在R这样的事情
最终结果应如下所示:
UNIT STATUS TERMINATED START STOP
1 100 ACTIVE 2008-12-05 2007-04-23 2008-12-05
2 100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
3 200 ACTIVE 2008-04-18 2004-06-01 2007-01-31
4 200 ACTIVE 2008-04-18 2007-02-01 2008-04-18
5 200 INACTIVE 2008-04-18 2008-04-19 2010-11-28
6 200 ACTIVE 2010-12-29 2010-11-29 2010-12-29
7 200 INACTIVE 2010-12-29 2010-12-30 4712-12-31
8 300 ACTIVE 2009-02-06 2007-10-29 2008-02-04
9 300 ACTIVE 2009-02-06 2008-02-05 2008-06-29
10 300 ACTIVE 2009-02-06 2008-06-30 2009-02-06
11 300 INACTIVE 2009-02-06 2009-02-07 4712-12-31
Run Code Online (Sandbox Code Playgroud)
我没有花太多时间在这上面,但我认为你应该能够通过以下方式做你需要的事情.
将日期转换为实际日期格式.
## Use a real date format
DAT[-c(1, 2)] <- lapply(DAT[-c(1, 2)], as.Date)
Run Code Online (Sandbox Code Playgroud)根据UNIT的组合以及STATUS列更改时创建"组".
## Identify the "groups" of "ACTIVE" and "INACTIVE"
## by a combination of the first two columns
RLE <- rle(do.call(paste, DAT[1:2]))$lengths
RLES <- rep(seq_along(RLE), RLE)
RLES
# [1] 1 2 3 3 4 5 6 7 7 7 8
Run Code Online (Sandbox Code Playgroud)
你可以在这里看到第1行来自第一个"组",第2行来自第二个,第3行是第3行,依此类推.
替换当前的TERMINATED列.
通过使用存储的结果RLES,我们可以使用ave创建一个长度与包含最后一个STOP日期的行数相同的向量.
## Use that grouping to create a partially corrected
## "TERMINATED" column
DAT$TERMINATED <- ave(DAT$STOP, RLES, FUN = max)
Run Code Online (Sandbox Code Playgroud)修复STATUS =="INACTIVE"时的TERMINATED值.
根据您的描述,此处的值应等于START"列减去1的值.
## Identify the rows where STATUS == "INACTIVE"
IRows <- which(DAT$STATUS == "INACTIVE")
## Since you have a real date format, you can
## simply use "-1" to adjust the TERMINATED date
## using the value from the "START" date
DAT[IRows, "TERMINATED"] <- DAT[IRows, "START"] - 1
Run Code Online (Sandbox Code Playgroud)检查结果.
DAT
# UNIT STATUS TERMINATED START STOP
# 1 100 ACTIVE 2008-12-05 2007-04-23 2008-12-05
# 2 100 INACTIVE 2008-12-05 2008-12-06 4712-12-31
# 3 200 ACTIVE 2008-04-18 2004-06-01 2007-01-31
# 4 200 ACTIVE 2008-04-18 2007-02-01 2008-04-18
# 5 200 INACTIVE 2008-04-18 2008-04-19 2010-11-28
# 6 200 ACTIVE 2010-12-29 2010-11-29 2010-12-29
# 7 200 INACTIVE 2010-12-29 2010-12-30 4712-12-31
# 8 300 ACTIVE 2009-02-06 2007-10-29 2008-02-04
# 9 300 ACTIVE 2009-02-06 2008-02-05 2008-06-29
# 10 300 ACTIVE 2009-02-06 2008-06-30 2009-02-06
# 11 300 INACTIVE 2009-02-06 2009-02-07 4712-12-31
Run Code Online (Sandbox Code Playgroud)