按开始和结束值的ID行块

Chr*_*ris 10 r data.table

我需要通过start-row和end-row标准来识别data.table中的行块.在下面的MWE中,起始行由colA =="d"定义,并且该组继续直到colA =="a"

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
in.data$wanted.column <- c(NA, NA, NA, NA, 1, 1, 1, NA, NA, NA, 2, 2, 2, 2, NA)

in.data
#     colA wanted.column
#  1:    b            NA
#  2:    f            NA
#  3:    b            NA
#  4:    k            NA
#  5:    d             1
#  6:    b             1
#  7:    a             1
#  8:    s            NA
#  9:    a            NA
# 10:    n            NA
# 11:    d             2
# 12:    f             2
# 13:    d             2
# 14:    a             2
# 15:    t            NA
Run Code Online (Sandbox Code Playgroud)

(如果组外值为NA,零或任何其他可识别结果,则无关紧要)

Mar*_*pov 5

UPDATE

答案的原始版本寻找最短的序列,这是不对的,因为它们可以在中间包含起始符号,例如c('d','f','d','a').已编辑的答案版本修复了此问题

UPDATE2

我被告知,当两个序列相互跟随时(例如in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "d", "f", "d", "a", "t"))),它们被列举为一个解决方案,这是错误的.在这里,我通过跟踪symbol.stop符号的出现来解决这个问题colA.

建立

library(data.table)
in.data <- data.table(colA=c("b", "f", "b", "k", "d", "b", "a", "s", "a", "n", "d", "f", "d", "a", "t"))
symbol.start='d'
symbol.stop='a'
Run Code Online (Sandbox Code Playgroud)

实际代码

in.data[,y := rev(cumsum(rev(colA)==symbol.stop))][,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]

in.data$out[in.data$out] <- as.factor(max(in.data$y)-in.data$y[in.data$out])
Run Code Online (Sandbox Code Playgroud)

在这里,[,y := rev(cumsum(rev(colA)==symbol.stop))]创建一个列y,可用于symbol.stop按照背面的出现对数据集进行分组.该[,out:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N),by=y]表达式返回一个布尔矢量,告诉一个行是否属于该start.symbol...end.symbol序列.需要下一行来枚举这样的序列.

清理并输出

in.data$y <- NULL   

in.data
#     colA out
#  1:    b   0
#  2:    f   0
#  3:    b   0
#  4:    k   0
#  5:    d   1
#  6:    b   1
#  7:    a   1
#  8:    s   0
#  9:    a   0
# 10:    n   0
# 11:    d   2
# 12:    f   2
# 13:    d   2
# 14:    a   2
# 15:    t   0
Run Code Online (Sandbox Code Playgroud)

UPDATE3

为了防止有人需要它,单线解决方案:

in.data[     , y := rev(cumsum(rev(colA)==symbol.stop))
      ][     , z:=(!match(symbol.start,colA,nomatch=.N+1)>1:.N), by=y
      ][ z==T, out:=as.numeric(factor(y,levels=unique(y)))
      ][     , c('z','y'):=list(NULL,NULL)]
Run Code Online (Sandbox Code Playgroud)