我有两张关于人的数据表:
df1 <- data.frame(id=c(113,202,377,288,359),
name=c("Alex","Silvia","Peter","Jack","Jonny"))
Run Code Online (Sandbox Code Playgroud)
这为我提供了
id name
1 113 Alex
2 202 Silvia
3 377 Peter
4 288 Jack
5 359 Jonny
Run Code Online (Sandbox Code Playgroud)
我有第二张表,其中包含其家庭成员的姓名:
df2 <- data.frame(id=c(113,113,113,202,202,359,359,359,359),
family.members=c("Ross","Jefferson","Max","Jo","Michael","Jimmy","Rex","Bill","Larry"))
Run Code Online (Sandbox Code Playgroud)
这为我提供了:
> df2
id family.members
1 113 Ross
2 113 Jefferson
3 113 Max
4 202 Jo
5 202 Michael
6 359 Jimmy
7 359 Rex
8 359 Bill
9 359 Larry
Run Code Online (Sandbox Code Playgroud)
现在我想扩展表1,其中包含每个人的家庭成员总数的附加列:
id name no.family.memebers
1 113 Alex 3
2 202 Silvia 2
3 377 Peter 0
4 288 Jack 0
5 359 Jonny 4
Run Code Online (Sandbox Code Playgroud)
在R中创建第三个表的最佳方法是什么?
非常感谢你提前!
运用 dplyr
library(dplyr)
df1 <- df1 %>% left_join((
df2 %>% group_by(id) %>%
summarize(no.family.members = n())
)
)
Run Code Online (Sandbox Code Playgroud)
当dplyr> = 0.3.0.2时,它可以被重写为
df3 <- df1 %>% left_join(df2 %>% count(id))
Run Code Online (Sandbox Code Playgroud)
df1 <- df1[order(df1$id), ] # Just to be safe
# the counts vector will be ordered by df2$id
counts <- with (df2, tapply(family.members, id, length))
df1$no.family.members[df1$id %in% names(counts)]<- counts
df1
id name no.family.members
1 113 Alex 3
2 202 Silvia 2
4 288 Jack NA
5 359 Jonny 4
3 377 Peter NA
Run Code Online (Sandbox Code Playgroud)
(我认为NA比0更具信息量.)