X.J*_*Jun 7 r dplyr data.table
假设我有这样的数据框:
family relationship meanings edu
1 1 A respondent 12
2 1 B respondent's spouse 18
3 1 C A's father 10
4 1 D A's mother 9
5 1 E1 A's first son 15
6 1 F1 E1's spouse 14
7 1 G11 E1's first son 3
8 1 G12 E1's second son 1
9 1 E2 A's second son 13
10 2 A respondent 21
11 2 B respondent's spouse 16
12 2 C A's father 12
13 2 D A's mother 16
14 2 E1 A's first son 18
15 2 F1 E1's spouse 15
16 2 E2 A's second son 17
17 2 E3 A's third son 16
Run Code Online (Sandbox Code Playgroud)
family表示家庭号码。relationship表示一个家庭的关系。meanings表示第二列的含义,relationship。
我想计算一个家庭中父代的最大受教育年数。我们不需要配偶的信息。
预期结果如下:
family id edu fedu
1 1 A 12 10
2 1 C 10 NA
3 1 E1 15 18
4 1 E2 13 18
5 1 G11 3 15
6 1 G12 1 15
7 2 A 21 16
8 2 C 12 NA
9 2 E1 18 21
10 2 E2 17 21
11 2 E3 16 21
Run Code Online (Sandbox Code Playgroud)
这是数据:
d = structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", "G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"), meanings = c("respondent", "respondent's spouse", "A's father","A's mother", "A's first son", "E1's spouse", "E1's first son","E1's second son", "A's second son", "respondent", "respondent's spouse","A's father", "A's mother", "A's first son", "E1's spouse", "A's second son","A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,16, 12, 16, 18, 15, 17, 16)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)
这是我尝试过的。我认为有必要创建一个生成变量。看到您问题中的示例图像,C并且D是第一代。A并且B是第2代。E是F第3代,G是第4代。第一个创建mutate()了case_when()生成变量。然后,我通过family和定义了组generation。对于每个组,我确定了最大教育持续时间(即max_ed_duration)。由于您说您不需要配偶的信息,因此我删除了包含母亲或配偶的行meanings。然后,我再次使用定义组family。对于每个族,如果generation是 1,则将 NA 分配给fedu。否则,将max_ed_duration上一代的值分配给fedu。family最后我按照和 来排列数据relationship。
library(dplyr)
mutate(mydf, generation = case_when(relationship %in% c("C", "D") ~ 1,
relationship %in% c("A", "B") ~ 2,
grepl(x = relationship, pattern = "^E|F") ~ 3,
grepl(x = relationship, pattern = "^G") ~ 4)) %>%
group_by(family, generation) %>%
mutate(max_ed_duration = max(edu)) %>%
filter(!grepl(x = meanings, pattern = "mother|spouse")) %>%
group_by(family) %>%
mutate(fedu = if_else(generation == 1,
NA_real_,
max_ed_duration[match(x = generation - 1, table = generation)])) %>%
arrange(family, relationship)
# family relationship meanings edu generation max_ed_duration fedu
# <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 A respondent 12 2 18 10
# 2 1 C A's father 10 1 10 NA
# 3 1 E1 A's first son 15 3 15 18
# 4 1 E2 A's second son 13 3 15 18
# 5 1 G11 E1's first son 3 4 3 15
# 6 1 G12 E1's second son 1 4 3 15
# 7 2 A respondent 21 2 21 16
# 8 2 C A's father 12 1 16 NA
# 9 2 E1 A's first son 18 3 18 21
#10 2 E2 A's second son 17 3 18 21
#11 2 E3 A's third son 16 3 18 21
Run Code Online (Sandbox Code Playgroud)
数据
mydf <- structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2,
2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1",
"G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"
), meanings = c("respondent", "respondent's spouse", "A's father",
"A's mother", "A's first son", "E1's spouse", "E1's first son",
"E1's second son", "A's second son", "respondent", "respondent's spouse",
"A's father", "A's mother", "A's first son", "E1's spouse", "A's second son",
"A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,
16, 12, 16, 18, 15, 17, 16)), class = "data.frame", row.names = c(NA,
-17L))
Run Code Online (Sandbox Code Playgroud)