我担心我的头衔并不能完全反映我的问题.让我尽我所能地解释我的问题.这是我数据的一部分.
pdpass begyr1 gvkey1 endyr1 begyr2 gvkey2 endyr2 begyr3 gvkey3 endyr3 begyr4 gvkey4 endyr4 begyr5 gvkey5 endyr5
-16348 1991 26005 1993 1994 8852 1996 1997 3708 2005 NA NA NA NA NA NA
-16179 2006 174876 2006 NA NA NA NA NA NA NA NA NA NA NA NA
-16161 1993 30964 2005 NA NA NA NA NA NA NA NA NA NA NA NA
-15631 1996 111491 2006 NA NA NA NA NA NA NA NA NA NA NA NA
-14668 1988 15854 2003 NA NA NA NA NA NA NA NA NA NA NA NA
-13980 1986 13218 1995 1996 9921 2005 NA NA NA NA NA NA NA NA NA
Run Code Online (Sandbox Code Playgroud)
由于它是一个宽数据,所以我也会输入dput值.
#dput
structure(list(pdpass = c(-16348L, -16179L, -16161L, -15631L, -14668L, -13980L),
begyr1 = c(1991L, 2006L, 1993L, 1996L, 1988L, 1986L), gvkey1 = c(26005L, 174876L, 30964L, 111491L, 15854L, 13218L), endyr1 = c(1993L, 2006L, 2005L, 2006L, 2003L, 1995L),
begyr2 = c(1994L, NA, NA, NA, NA, 1996L), gvkey2 = c(8852L, NA, NA, NA, NA, 9921L), endyr2 = c(1996L, NA, NA, NA, NA, 2005L),
begyr3 = c(1997L, NA, NA, NA, NA, NA), gvkey3 = c(3708L, NA, NA, NA, NA, NA), endyr3 = c(2005L, NA, NA, NA, NA, NA),
begyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr4 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_),
begyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), gvkey5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), endyr5 = c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_)), row.names = c(NA, -6L), class = "data.frame", .Names = c("pdpass", "begyr1", "gvkey1", "endyr1", "begyr2", "gvkey2", "endyr2", "begyr3", "gvkey3", "endyr3", "begyr4", "gvkey4", "endyr4", "begyr5", "gvkey5", "endyr5"))
Run Code Online (Sandbox Code Playgroud)
我想把它整理成一个很长的格式.
该pdpass列是ID变量.我想收集所有begyrs,endyears和gvkey值为每个pdpass值.但是,正如您从原始数据中看到的那样,有些pdpass只有一个gvkey, endyear, begyears值,而有些值只有一个.(它们表示为gvkey2, endyear2, begyear2, gvkey3......等等).
基本上,我想作一排只有当gvkey-endyear-begyear值具有相同数量的结尾每个pdpass
轻松放,这是我想要的输出:
pdpass gvkeyN gvkey begyearN begyear endyearN endyear
1 -16348 gvkey1 26005 begyear1 1991 endyear1 1993
2 -16348 gvkey2 8852 begyear2 1994 endyear2 1996
3 -16348 gvkey3 3708 begyear3 1997 endyear3 2005
4 -16179 gvkey1 174876 begyear1 2006 endyear1 2006
5 -16161 gvkey1 30964 begyear1 1993 endyear1 2005
6 -15631 gvkey1 111491 begyear1 1996 endyear1 2006
7 -14668 gvkey1 15854 begyear1 1988 endyear1 2003
8 -13980 gvkey1 13218 begyear1 1986 endyear1 1995
9 -13980 gvkey2 13218 begyear1 1996 endyear2 2005
10 -13956 gvkey1 8674 begyear1 1968 endyear2 1987
# code
desired<-data.frame(pdpass=c(rep(-16348, 3), -16179, -16161, -15631, -14668, rep(-13980, 2), -13956),
gvkeyN=c("gvkey1", "gvkey2", "gvkey3", rep("gvkey1", 5), "gvkey2", "gvkey1"),
gvkey=c(26005, 8852, 3708, 174876, 30964, 111491, 15854, 13218, 13218, 8674),
begyearN=c("begyear1", "begyear2", "begyear3", rep("begyear1", 5), "begyear1", "begyear1"),
begyear=c(1991, 1994, 1997, 2006, 1993, 1996, 1988, 1986, 1996, 1968),
endyearN=c("endyear1", "endyear2", endyear3", rep("endyear1", 5), "endyear2", "endyear2"),
endyear=c(1993, 1996, 2005, 2006, 2005, 2006, 2003, 1995, 2005, 1987))`
Run Code Online (Sandbox Code Playgroud)
这是我到目前为止所尝试的
require(tidyr)
d1<-gather(NBER, gvkeyN, gvkey, -pdpass, -endyr1, -begyr1, -endyr2, -begyr2, -endyr3, -begyr3, -endyr4, -begyr4, -endyr5, -begyr5, na.rm=TRUE)
d2<-gather(d1, begyrN, begyear, -gvkeyN, -gvkey, -pdpass, -endyr1, -endyr2, -endyr3, -endyr4, -endyr5, na.rm=TRUE)
d3<-gather(d2, endyrN, endyear, -gvkeyN, -gvkey, -pdpass, -begyrN, -begyear, na.rm=TRUE)
d4<-arrange(d3, pdpass)
head(d4, 10)
pdpass gvkeyN gvkey begyrN begyear endyrN endyear
1 -16348 gvkey1 26005 begyr1 1991 endyr1 1993
2 -16348 gvkey2 8852 begyr1 1991 endyr1 1993
3 -16348 gvkey3 3708 begyr1 1991 endyr1 1993
4 -16348 gvkey1 26005 begyr2 1994 endyr1 1993
5 -16348 gvkey2 8852 begyr2 1994 endyr1 1993
6 -16348 gvkey3 3708 begyr2 1994 endyr1 1993
7 -16348 gvkey1 26005 begyr3 1997 endyr1 1993
8 -16348 gvkey2 8852 begyr3 1997 endyr1 1993
9 -16348 gvkey3 3708 begyr3 1997 endyr1 1993
10 -16348 gvkey1 26005 begyr1 1991 endyr2 1996
Run Code Online (Sandbox Code Playgroud)
如您所见,我得到了不必要的行.我需要的只是行gvkey, endyear, and begyear完全匹配的数字.(例如,gvkey 1 begyr 1 endyr 1; gvkey 2 begyr 2 endyr 2; gvkey3 begyr3 endyr3.等); 但是,我得到了1,2和3的组合.因此,上面的第2行:9都是不必要的.
也许,我可以以某种方式手动删除收集后的所有不必要的行.但是,我认为可能会更好地利用这个gather功能..我已经看过类似的问题,但找不到解决这个问题的方法.任何人都可以帮我解决这些问题吗?
我们可以使用melt从data.table
library(data.table)
melt(setDT(df1), measure = patterns("^gvkey", "^begyr", "^endyr"),
na.rm = TRUE, value.name = c("gvkey", "begyear", "endyear"))
Run Code Online (Sandbox Code Playgroud)