按组"递归"查找第一行/最后一行

sta*_*ant 1 r data.table

我试图找到一种有效的方法来按组查找第一行和最后一行.

R) ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),city=c("TU","MI","MI","MI","MI","MI","MI"),code=c(85730,33133,33133,33133,33146,33146,45056))
R) ex
   state city  code
1:    az   TU 85730           
2:    fl   MI 33133           
3:    fl   MI 33133           
4:    fl   MI 33133           
5:    fl   MI 33146           
6:    fl   MI 33146           
7:    oh   MI 45056           
Run Code Online (Sandbox Code Playgroud)

我想找到一个组的每个变量的第一个和最后一个

R) ex
   state city  code first.state last.state first.city last.city first.code last.code
1:    az   TU 85730           1          1          1         1          1         1
2:    fl   MI 33133           1          0          1         0          1         0
3:    fl   MI 33133           0          0          0         0          0         0
4:    fl   MI 33133           0          0          0         0          0         1
5:    fl   MI 33146           0          0          0         0          1         0
6:    fl   MI 33146           0          1          0         1          0         1
7:    oh   MI 45056           1          1          1         1          1         1
Run Code Online (Sandbox Code Playgroud)

据我所知data.table,不能轻易帮助这样的事情,因为by="state,city,code"会看4三胞胎.

我知道的唯一方法是在by ="state,city,code"中查找first/last.code,然后在by ="state,city"中查找first/last.city.


这就是我的意思:

applyAll <- function(DT, by){
    f<- function(n, vec){ return(vec[1:n]) }
    by <- lapply(1:length(by), FUN=f, by)
    out <- Reduce(f=firstLast, init=DT, x=by)
    return(out)
}
firstLast <- function(DT, by){
    addNames <- paste(c("first", "last"),by[length(by)], sep=".")
    DT[DT[,list(IDX=.I[1]), by=by]$IDX, addNames[1]:=1]
    DT[DT[,list(IDX=.I[.N]), by=by]$IDX, addNames[2]:=1]
    return(DT);
}
Run Code Online (Sandbox Code Playgroud)

结果是:applyAll(ex,c("state","city","code"))但是这会产生NUMEROUS副本DT,我的问题是,是否有某些安排或已经存在,以便我们无法获得第一组/最后一组.(这是相当香草为SASkdbSQL)

SAS:

data DT;
    set ex;
    by state city code;
    if first.code then firstcode=1;
    if last.code then lastcode=1;
    if first.city then firstcity=1;
    if last.city then lastcity=1;
    if first.state then firststate=1;
    if last.state then laststate=1;
run;
Run Code Online (Sandbox Code Playgroud)

Mat*_*wle 5

如果这是问题:

对于一组列(x,y,z),我想添加一个整数列,标记每个组的第一个项目的位置by="x",by="x,y"以及by="x,y,z"(三个新列).每个新列的第一行始终为1,因为它始终是第一个组的第一个项目.我还想在相同的3个分组中添加另外3列标记最后一个项目.不过,我可能只有3个以上的分组,所以请编程可能吗?

那怎么样:

ex=data.table(state=c("az","fl","fl","fl","fl","fl","oh"),
              city=c("TU","MI","MI","MI","MI","MI","MI"),
              code=c(85730,33133,33133,33133,33146,33146,45056))
ex
   state city  code
1:    az   TU 85730
2:    fl   MI 33133
3:    fl   MI 33133
4:    fl   MI 33133
5:    fl   MI 33146
6:    fl   MI 33146
7:    oh   MI 45056

cols = c("state","city","code")
for (i in seq_along(cols)) {
  ex[,paste0("f.",cols[i]):=c(1L,rep(0L,.N-1L)),by=eval(head(cols,i))] # first
  ex[,paste0("l.",cols[i]):=c(rep(0L,.N-1L),1L),by=eval(head(cols,i))] # last
}
ex
   state city  code f.state l.state f.city l.city f.code l.code
1:    az   TU 85730       1       1      1      1      1      1
2:    fl   MI 33133       1       0      1      0      1      0
3:    fl   MI 33133       0       0      0      0      0      0
4:    fl   MI 33133       0       0      0      0      0      1
5:    fl   MI 33146       0       0      0      0      1      0
6:    fl   MI 33146       0       1      0      1      0      1
7:    oh   MI 45056       1       1      1      1      1      1
Run Code Online (Sandbox Code Playgroud)

但正如@Roland评论的那样,可能有更好的方法来实现你的最终目标.

并且,根据要求,这里应该是一个更快的解决方案使用.I.N:

cols = c("state","city","code")
for (i in seq_along(cols)) {
  w = ex[,list(f=.I[1],l=.I[.N]),by=eval(head(cols,i))]
  ex[,paste0(c("f.","l."),cols[i]):=0L]  # add the two 0 columns
  ex[w$f,paste0("f.",cols[i]):=1L]       # mark the firsts
  ex[w$l,paste0("l.",cols[i]):=1L]       # mark the lasts
}
Run Code Online (Sandbox Code Playgroud)

它应该更快,因为每个列只进行一次分组,并且与第一个解决方案不同,不会创建大量小向量(不会调用c()rep()为每个组调用).