Chr*_*ris 10 r data.table
我需要通过start-row和end-row标准来识别data.table中的行块.在下面的MWE中,起始行由colA =="d"定义,并且该组继续直到colA =="a"
library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
in.data$wanted.column <- c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, NA)
in.data
# colA wanted.column
# 1: b NA
# 2: f NA
# 3: b NA
# 4: k NA
# 5: d 1
# 6: b 1
# 7: a 1
# 8: s NA
# 9: a NA
# 10: n NA
# 11: d 2
# 12: f 2
# 13: d 2
# 14: a 2
# 15: t NA
Run Code Online (Sandbox Code Playgroud)
(如果组外值为NA,零或任何其他可识别结果,则无关紧要)
答案的原始版本寻找最短的序列,这是不对的,因为它们可以在中间包含起始符号,例如c('d','f','d','a').已编辑的答案版本修复了此问题
我被告知,当两个序列相互跟随时(例如in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "d", "f", "d", "a", "t"))),它们被列举为一个解决方案,这是错误的.在这里,我通过跟踪symbol.stop符号的出现来解决这个问题colA.
建立
library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
symbol.start='d'
symbol.stop='a'
Run Code Online (Sandbox Code Playgroud)
实际代码
in.data[,y := rev(cumsum(rev(colA)==symbol.stop))][,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]
in.data$out[in.data$out] <- as.factor(max(in.data$y)-in.data$y[in.data$out])
Run Code Online (Sandbox Code Playgroud)
在这里,[,y := rev(cumsum(rev(colA)==symbol.stop))]创建一个列y,可用于symbol.stop按照背面的出现对数据集进行分组.该[,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]表达式返回一个布尔矢量,告诉一个行是否属于该start.symbol...end.symbol序列.需要下一行来枚举这样的序列.
清理并输出
in.data$y <- NULL
in.data
# colA out
# 1: b 0
# 2: f 0
# 3: b 0
# 4: k 0
# 5: d 1
# 6: b 1
# 7: a 1
# 8: s 0
# 9: a 0
# 10: n 0
# 11: d 2
# 12: f 2
# 13: d 2
# 14: a 2
# 15: t 0
Run Code Online (Sandbox Code Playgroud)
为了防止有人需要它,单线解决方案:
in.data[ , y := rev(cumsum(rev(colA)==symbol.stop))
][ , z:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N), by=y
][ z==T, out:=as.numeric(factor(y,levels=unique(y)))
][ , c('z','y'):=list(NULL,NULL)]
Run Code Online (Sandbox Code Playgroud)