使用dplyr或datatable每年的公司数量

Mis*_*lav 6 r data-manipulation dplyr data.table

让我们说我有数据框:

df <- data.frame(City = c("NY", "NY", "NY", "NY", "NY", "LA", "LA", "LA", "LA"),
                 YearFrom = c("2001", "2003", "2002", "2006", "2008", "2004", "2005", "2005", "2002"),
                 YearTo = c(NA, "2005", NA, NA, "2009", NA, "2008", NA, NA))
Run Code Online (Sandbox Code Playgroud)

其中YearFrom是例如公司成立的年份,YearTo是取消的年份.如果YearTo是NA,那么它仍在工作.

我想计算每年的公司数量.

该表应如下所示

City    |"Year"   |"Count"
"NY"    |2001       1
"NY"    |2002       2
"NY"    |2003       3
"NY"    |2004       3
"NY"    |2005       2
"NY"    |2006       3
"NY"    |2007       3
"NY"    |2008       4
"NY"    |2009       3
"LA"    |2001       0
"LA"    |2002       1
"LA"    |2003       1
"LA"    |2004       2
"LA"    |2005       4
"LA"    |2006       4
"LA"    |2007       4
"LA"    |2008       2
"LA"    |2009       2
Run Code Online (Sandbox Code Playgroud)

我想通过dplyr或datatable包解决这个问题,但我无法弄清楚如何?

Fra*_*ank 8

首先,清理数据......

curr_year = as.integer(year(Sys.Date()))

library(data.table)
setDT(df)
df[, YearTo := as.integer(as.character(YearTo)) ]
df[, YearFrom := as.integer(as.character(YearFrom)) ]
df[, quasiYearTo := YearTo ]
df[is.na(YearTo), quasiYearTo := curr_year ]
Run Code Online (Sandbox Code Playgroud)

然后,非equi连接:

df[CJ(City = City, Year = min(YearFrom):max(YearTo, na.rm=TRUE), unique=TRUE), 
  on=.(City, YearFrom <= Year, quasiYearTo > Year), allow.cartesian = TRUE, 
  .N
, by=.EACHI][, .(City, Year = YearFrom, N)]

    City Year N
 1:   LA 2001 0
 2:   LA 2002 1
 3:   LA 2003 1
 4:   LA 2004 2
 5:   LA 2005 4
 6:   LA 2006 4
 7:   LA 2007 4
 8:   LA 2008 3
 9:   LA 2009 3
10:   NY 2001 1
11:   NY 2002 2
12:   NY 2003 3
13:   NY 2004 3
14:   NY 2005 2
15:   NY 2006 3
16:   NY 2007 3
17:   NY 2008 4
18:   NY 2009 3
Run Code Online (Sandbox Code Playgroud)


Axe*_*man 7

更短的tidyverse解决方案.

# Firsts some data prep
df <- mutate(df,
    YearFrom = as.numeric(as.character(YearFrom)),                     #Fix year coding
    YearTo = as.numeric(as.character(YearTo)),
    YearTo = coalesce(YearTo, max(c(YearFrom, YearTo), na.rm = TRUE))) #Replace NA with max

df %>% 
  mutate(Years = map2(YearFrom, YearTo - 1, `:`)) %>%          #Find all years
  unnest() %>%                                                 #Spread over rows
  count(Years, City) %>%                                       #Count them
  complete(City, Years, fill = list(n = 0))                    #Add in zeros, if needed
Run Code Online (Sandbox Code Playgroud)


lmo*_*lmo 5

这是使用的一个答案data.table。数据准备在底部。

# get list of businesses, one obs per year of operation
cityList <- lapply(seq_len(nrow(df)),
              function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))])

# combine to a single data.table
dfNew <- rbindlist(cityList)

# get counts
dfNew <- dfNew[, .(Count=.N), by=.(City, Year)]
Run Code Online (Sandbox Code Playgroud)

写成一行,这是

# get the counts
rbindlist(lapply(seq_len(nrow(df)),
          function(i) df[i, .(City, "Year"=seq(YearFrom, YearTo - 1))]))[, .(Count=.N),
  by=.(City, Year)]
Run Code Online (Sandbox Code Playgroud)

在这里,lapply遍历每一行并构造一个数据表,该表具有重复的城市值,其中第一列为第二列,第二年则是具有工作年限的列。在这里,YearTo递​​减,因此不包括关闭年份。请注意,在数据准备中,缺失值设置为2018,以便包括当前年份。

lapply返回data.tables的列表,该列表通过组合到单个data.table中rbindlist。此data.table汇总为城市年份对,并使用构造计数.N

这些回报

    City Year Count
 1:   NY 2001     1
 2:   NY 2002     2
 3:   NY 2003     3
 4:   NY 2004     3
 5:   NY 2005     2
 6:   NY 2006     3
 7:   NY 2007     3
  ...
26:   LA 2012     3
27:   LA 2013     3
28:   LA 2014     3
29:   LA 2015     3
30:   LA 2016     3
31:   LA 2017     3
32:   LA 2002     1
33:   LA 2003     1
Run Code Online (Sandbox Code Playgroud)

数据

setDT(df)
# convert string years to integers
df[, grep("Year", names(df), value=TRUE) := 
   lapply(.SD, function(x) as.integer(as.character(x))), .SDcols=grep("Year", names(df))]
# replace NA values with 2018 (to include 2017 in count)
df[is.na(YearTo), YearTo := 2018]
Run Code Online (Sandbox Code Playgroud)