一个家庭长辈最高受教育年限如何计算

X.J*_*Jun 7 r dplyr data.table

假设我有这样的数据框:

   family relationship meanings              edu
 1      1 A            respondent             12
 2      1 B            respondent's spouse    18
 3      1 C            A's father             10
 4      1 D            A's mother              9
 5      1 E1           A's first son          15
 6      1 F1           E1's spouse            14
 7      1 G11          E1's first son          3
 8      1 G12          E1's second son         1
 9      1 E2           A's second son         13
10      2 A            respondent             21
11      2 B            respondent's spouse    16
12      2 C            A's father             12
13      2 D            A's mother             16
14      2 E1           A's first son          18
15      2 F1           E1's spouse            15
16      2 E2           A's second son         17
17      2 E3           A's third son          16
Run Code Online (Sandbox Code Playgroud)

family表示家庭号码。relationship表示一个家庭的关系。meanings表示第二列的含义,relationship

第一个家庭的关系

我想计算一个家庭中父代的最大受教育年数。我们不需要配偶的信息。

预期结果如下:

   family id      edu fedu 
 1      1 A        12 10   
 2      1 C        10 NA   
 3      1 E1       15 18   
 4      1 E2       13 18   
 5      1 G11       3 15   
 6      1 G12       1 15   
 7      2 A        21 16   
 8      2 C        12 NA   
 9      2 E1       18 21   
10      2 E2       17 21   
11      2 E3       16 21
Run Code Online (Sandbox Code Playgroud)

这是数据:

 d = structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", "G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"), meanings = c("respondent", "respondent's spouse", "A's father","A's mother", "A's first son", "E1's spouse", "E1's first son","E1's second son", "A's second son", "respondent", "respondent's spouse","A's father", "A's mother", "A's first son", "E1's spouse", "A's second son","A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21,16, 12, 16, 18, 15, 17, 16)), row.names = c(NA, -17L), class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)

jaz*_*rro 3

这是我尝试过的。我认为有必要创建一个生成变量。看到您问题中的示例图像,C并且D是第一代。A并且B是第2代。EF第3代,G是第4代。第一个创建mutate()case_when()生成变量。然后,我通过family和定义了组generation。对于每个组,我确定了最大教育持续时间(即max_ed_duration)。由于您说您不需要配偶的信息,因此我删除了包含母亲或配偶的行meanings。然后,我再次使用定义组family。对于每个族,如果generation是 1,则将 NA 分配给fedu。否则,将max_ed_duration上一代的值分配给fedufamily最后我按照和 来排列数据relationship

library(dplyr)

mutate(mydf, generation = case_when(relationship %in% c("C", "D") ~ 1,
                                    relationship %in% c("A", "B") ~ 2,
                                    grepl(x = relationship, pattern = "^E|F") ~ 3,
                                    grepl(x = relationship, pattern = "^G") ~ 4)) %>% 
  group_by(family, generation) %>% 
  mutate(max_ed_duration = max(edu)) %>% 
  filter(!grepl(x = meanings, pattern = "mother|spouse")) %>% 
  group_by(family) %>%
  mutate(fedu = if_else(generation == 1,
                        NA_real_,
                        max_ed_duration[match(x = generation - 1, table = generation)])) %>% 
  arrange(family, relationship)

#   family relationship meanings          edu generation max_ed_duration  fedu
#    <dbl> <chr>        <chr>           <dbl>      <dbl>           <dbl> <dbl>
# 1      1 A            respondent         12          2              18    10
# 2      1 C            A's father         10          1              10    NA
# 3      1 E1           A's first son      15          3              15    18
# 4      1 E2           A's second son     13          3              15    18
# 5      1 G11          E1's first son      3          4               3    15
# 6      1 G12          E1's second son     1          4               3    15
# 7      2 A            respondent         21          2              21    16
# 8      2 C            A's father         12          1              16    NA
# 9      2 E1           A's first son      18          3              18    21
#10      2 E2           A's second son     17          3              18    21
#11      2 E3           A's third son      16          3              18    21
Run Code Online (Sandbox Code Playgroud)

数据

mydf <- structure(list(family = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 
2, 2, 2, 2, 2), relationship = c("A", "B", "C", "D", "E1", "F1", 
"G11", "G12", "E2", "A", "B", "C", "D", "E1", "F1", "E2", "E3"
), meanings = c("respondent", "respondent's spouse", "A's father", 
"A's mother", "A's first son", "E1's spouse", "E1's first son", 
"E1's second son", "A's second son", "respondent", "respondent's spouse", 
"A's father", "A's mother", "A's first son", "E1's spouse", "A's second son", 
"A's third son"), edu = c(12, 18, 10, 9, 15, 14, 3, 1, 13, 21, 
16, 12, 16, 18, 15, 17, 16)), class = "data.frame", row.names = c(NA, 
-17L))
Run Code Online (Sandbox Code Playgroud)