Mis*_*lav 6 r data-manipulation dplyr data.table
让我们说我有数据框:
df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
Run Code Online (Sandbox Code Playgroud)
其中YearFrom是例如公司成立的年份,YearTo是取消的年份.如果YearTo是NA,那么它仍在工作.
我想计算每年的公司数量.
该表应如下所示
City |"Year" |"Count"
"NY" |2001 1
"NY" |2002 2
"NY" |2003 3
"NY" |2004 3
"NY" |2005 2
"NY" |2006 3
"NY" |2007 3
"NY" |2008 4
"NY" |2009 3
"LA" |2001 0
"LA" |2002 1
"LA" |2003 1
"LA" |2004 2
"LA" |2005 4
"LA" |2006 4
"LA" |2007 4
"LA" |2008 2
"LA" |2009 2
Run Code Online (Sandbox Code Playgroud)
我想通过dplyr或datatable包解决这个问题,但我无法弄清楚如何?
首先,清理数据......
curr_year = as.integer(year(Sys.Date()))
library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]
Run Code Online (Sandbox Code Playgroud)
然后,非equi连接:
df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE),
on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE,
.N
, by=.EACHI][, .(City, Year = YearFrom, N)]
City Year N
1: LA 2001 0
2: LA 2002 1
3: LA 2003 1
4: LA 2004 2
5: LA 2005 4
6: LA 2006 4
7: LA 2007 4
8: LA 2008 3
9: LA 2009 3
10: NY 2001 1
11: NY 2002 2
12: NY 2003 3
13: NY 2004 3
14: NY 2005 2
15: NY 2006 3
16: NY 2007 3
17: NY 2008 4
18: NY 2009 3
Run Code Online (Sandbox Code Playgroud)
更短的tidyverse解决方案.
# Firsts some data prep
df <- mutate(df,
YearFrom = as.numeric(as.character(YearFrom)), #Fix year coding
YearTo = as.numeric(as.character(YearTo)),
YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max
df %>%
mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>% #Find all years
unnest() %>% #Spread over rows
count(Years, City) %>% #Count them
complete(City, Years, fill = list(n = 0)) #Add in zeros, if needed
Run Code Online (Sandbox Code Playgroud)
这是使用的一个答案data.table。数据准备在底部。
# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])
# combine to a single data.table
dfNew <- rbindlist(cityList)
# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]
Run Code Online (Sandbox Code Playgroud)
写成一行,这是
# get the counts
rbindlist(lapply(seq_len(nrow(df)),
function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
by=.(City, Year)]
Run Code Online (Sandbox Code Playgroud)
在这里,lapply遍历每一行并构造一个数据表,该表具有重复的城市值,其中第一列为第二列,第二年则是具有工作年限的列。在这里,YearTo递减,因此不包括关闭年份。请注意,在数据准备中,缺失值设置为2018,以便包括当前年份。
lapply返回data.tables的列表,该列表通过组合到单个data.table中rbindlist。此data.table汇总为城市年份对,并使用构造计数.N。
这些回报
City Year Count
1: NY 2001 1
2: NY 2002 2
3: NY 2003 3
4: NY 2004 3
5: NY 2005 2
6: NY 2006 3
7: NY 2007 3
...
26: LA 2012 3
27: LA 2013 3
28: LA 2014 3
29: LA 2015 3
30: LA 2016 3
31: LA 2017 3
32: LA 2002 1
33: LA 2003 1
Run Code Online (Sandbox Code Playgroud)
数据
setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) :=
lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]
Run Code Online (Sandbox Code Playgroud)