为数据帧的每个组内的行创建一个序列号(计数器)

sur*_*esh 33 r dataframe

我们如何在数据帧的每个组中生成唯一的ID号?这里有一些按"personid"分组的数据:

personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23
Run Code Online (Sandbox Code Playgroud)

我希望为"personid"定义的每个子集中的每一行添加一个唯一值的id列,始终以1.这是我想要的输出:

personid date measurement id
1         x     23         1
1         x     32         2
2         y     21         1
3         x     23         1
3         z     23         2
3         y     23         3
Run Code Online (Sandbox Code Playgroud)

我感谢任何帮助.

Jos*_*ien 28

ave()带有参数的误导性命名函数FUN=seq_along将很好地实现这一点 - 即使您的personid列没有严格排序.

df <- read.table(text = "personid date measurement
1         x     23
1         x     32
2         y     21
3         x     23
3         z     23
3         y     23", header=TRUE)

## First with your data.frame
ave(df$personid, df$personid, FUN=seq_along)
# [1] 1 2 1 1 2 3

## Then with another, in which personid is *not* in order
df2 <- df[c(2:6, 1),]
ave(df2$personid, df2$personid, FUN=seq_along)
# [1] 1 1 1 2 3 2
Run Code Online (Sandbox Code Playgroud)


Hen*_*rik 21

一些dplyr替代方案,使用便利功能row_numbern.

library(dplyr)
df %>% group_by(personid) %>% mutate(id = row_number())
df %>% group_by(personid) %>% mutate(id = 1:n())
df %>% group_by(personid) %>% mutate(id = seq_len(n()))
df %>% group_by(personid) %>% mutate(id = seq_along(personid))
Run Code Online (Sandbox Code Playgroud)

您也可以使用getanID包装splitstackshape.请注意,输入数据集作为a返回data.table.

getanID(data = df, id.vars = "personid")
#    personid date measurement .id
# 1:        1    x          23   1
# 2:        1    x          32   2
# 3:        2    y          21   1
# 4:        3    x          23   1
# 5:        3    z          23   2
# 6:        3    y          23   3
Run Code Online (Sandbox Code Playgroud)


mne*_*nel 14

使用data.table,并假设您希望datepersonid子集内进行排序

library(data.table)
DT <- data.table(Data)

DT[,id := order(date), by  = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  3
## 6:        3    y          23  2
Run Code Online (Sandbox Code Playgroud)

如果您不希望订购 date

DT[, id := 1:.N, by = personid]

##    personid date measurement id
## 1:        1    x          23  1
## 2:        1    x          32  2
## 3:        2    y          21  1
## 4:        3    x          23  1
## 5:        3    z          23  2
## 6:        3    y          23  3
Run Code Online (Sandbox Code Playgroud)

以下任何一种都可以

DT[, id := seq_along(measurement), by =  personid]
DT[, id := seq_along(date), by =  personid]
Run Code Online (Sandbox Code Playgroud)

等效命令使用 plyr

library(plyr)
# ordering by date
ddply(Data, .(personid), mutate, id = order(date))
# in original order
ddply(Data, .(personid), mutate, id = seq_along(date))
ddply(Data, .(personid), mutate, id = seq_along(measurement))
Run Code Online (Sandbox Code Playgroud)


Ari*_*man 7

我认为这是一个罐头命令,但我记不住了.所以这是一种方式:

> test <- sample(letters[1:3],10,replace=TRUE)
> cumsum(duplicated(test))
 [1] 0 0 1 1 2 3 4 5 6 7
> cumsum(duplicated(test))+1
 [1] 1 1 2 2 3 4 5 6 7 8
Run Code Online (Sandbox Code Playgroud)

这是有效的,因为duplicated返回逻辑向量. cumsum评估数字向量,因此逻辑被强制转换为数字.

如果需要,可以将结果存储为data.frame作为新列:

dat$id <- cumsum(duplicated(test))+1
Run Code Online (Sandbox Code Playgroud)


Jos*_*ich 5

假设您的数据位于一个名为data.frame的数据中Data,这将起到作用:

# ensure Data is in the correct order
Data <- Data[order(Data$personid),]
# tabulate() calculates the number of each personid
# sequence() creates a n-length vector for each element in the input,
# and concatenates the result
Data$id <- sequence(tabulate(Data$personid))
Run Code Online (Sandbox Code Playgroud)