我有兴趣去除具有时间固定和时变值的敏感数据集.我想(a)按社会安全号码对所有案件进行分组,(b)为这些案件分配一个唯一的ID,然后(c)删除社会安全号码.
这是一个示例数据集:
personal_id gender temperature
111-11-1111 M 99.6
999-999-999 F 98.2
111-11-1111 M 97.8
999-999-999 F 98.3
888-88-8888 F 99.0
111-11-1111 M 98.9
Run Code Online (Sandbox Code Playgroud)
任何解决方案都将非常感谢.
con*_*nor 32
dplyr
具有group_indices
创建唯一组ID的功能
library(dplyr)
data <- data.frame(personal_id = c("111-111-111", "999-999-999", "222-222-222", "111-111-111"),
gender = c("M", "F", "M", "M"),
temperature = c(99.6, 98.2, 97.8, 95.5))
data$group_id <- data %>% group_indices(personal_id)
data <- data %>% select(-personal_id)
data
gender temperature group_id
1 M 99.6 1
2 F 98.2 3
3 M 97.8 2
4 M 95.5 1
Run Code Online (Sandbox Code Playgroud)
或者在同一个管道中(https://github.com/tidyverse/dplyr/issues/2160):
data %>%
mutate(group_id = group_indices(., personal_id))
Run Code Online (Sandbox Code Playgroud)
tmf*_*mnk 28
dplyr::group_indices()
从dplyr 1.0.0
. dplyr::cur_group_id()
应该改用:
df %>%
group_by(personal_id) %>%
mutate(group_id = cur_group_id())
personal_id gender temperature group_id
<chr> <chr> <dbl> <int>
1 111-11-1111 M 99.6 1
2 999-999-999 F 98.2 3
3 111-11-1111 M 97.8 1
4 999-999-999 F 98.3 3
5 888-88-8888 F 99 2
6 111-11-1111 M 98.9 1
Run Code Online (Sandbox Code Playgroud)
归档时间: |
|
查看次数: |
15525 次 |
最近记录: |