6 r
我想基于名字,姓氏和年份组合两个表,并创建一个新的二进制变量,指示表1中的行是否存在于第二个表中.
第一张桌子是一个赛季NBA球员某些属性的面板数据集:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic","Larry","Larry")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson","Bird","Bird")
year<-c("1991","1992","1993","1991","1992","1993","1992","1992")
season<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
7 Larry Bird 1992
8 Larry Bird 1992
Run Code Online (Sandbox Code Playgroud)
第二个data.frame是选择参加全明星赛的NBA球员的一些属性的面板数据集:
firstname<-c("Michael","Michael","Michael","Magic","Magic","Magic")
lastname<-c("Jordan","Jordan","Jordan","Johnson","Johnson","Johnson")
year<-c("1991","1992","1993","1991","1992","1993")
ALLSTARS<-data.frame(firstname,lastname,year)
firstname lastname year
1 Michael Jordan 1991
2 Michael Jordan 1992
3 Michael Jordan 1993
4 Magic Johnson 1991
5 Magic Johnson 1992
6 Magic Johnson 1993
Run Code Online (Sandbox Code Playgroud)
我想要的结果如下:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Run Code Online (Sandbox Code Playgroud)
我试图使用左连接.但不确定这是否有意义:
test<-join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
Run Code Online (Sandbox Code Playgroud)
看来您正在使用join()plyr 包中的内容。你就快到了:只需在你的命令前面加上ALLSTARS$allstars <- 1. 然后按照编写的方式进行连接,最后将NA值转换为 0。所以:
ALLSTARS$allstars <- 1
test <- join(season, ALLSTARS, by =c("lastname","firstname","year") , type = "left", match = "all")
test$allstars[is.na(test$allstars)] <- 0
Run Code Online (Sandbox Code Playgroud)
结果:
firstname lastname year allstars
1 Michael Jordan 1991 1
2 Michael Jordan 1992 1
3 Michael Jordan 1993 1
4 Magic Johnson 1991 1
5 Magic Johnson 1992 1
6 Magic Johnson 1993 1
7 Larry Bird 1992 0
8 Larry Bird 1992 0
Run Code Online (Sandbox Code Playgroud)
尽管我个人会使用dplyr 包中的left_join或right_join,如 David 的回答,而不是 plyr 的join(). 另请注意,在这种情况下,您实际上不需要by的参数join(),因为默认情况下该函数将尝试连接具有通用名称的所有字段,这正是您想要的。