lae*_*a93 4 aggregate r dataframe
我有以下数据集:
ClaimType ClaimDay ClaimCost dates month day
1 1 1 10811 1970-01-01 1 1970-01-01
2 1 1 18078 1970-01-01 1 1970-01-01
3 1 2 44579 1970-01-01 1 1970-01-02
4 1 3 23710 1970-01-01 1 1970-01-03
5 1 4 29580 1970-01-01 1 1970-01-04
6 1 4 36208 1970-01-01 1 1970-01-04
Run Code Online (Sandbox Code Playgroud)
我想创建一个新的数据集,其中包含"声明日"和"日期"列.索赔日应按每个值计算.所以例如,因为我们有两个,一个是两个,一个是三个,然后是两个四,我希望新的数据集如下:
ClaimDay day
2 1970-01-01
1 1970-01-02
1 1970-01-03
2 1970-01-04
Run Code Online (Sandbox Code Playgroud)
如您所见,Claimday和day是相关的.
我试过了
mydata <- aggregate(ClaimDay~Day,FUN=sum,data=mydata)$ClaimDay
Run Code Online (Sandbox Code Playgroud)
但问题是,在聚合时它会计算摘要.
任何人都可以帮我解决我的问题吗?
San*_*Dey 10
您可以尝试以下任何一种方法:
同 base R
aggregate(ClaimDay~day,FUN=length,data=mydata)
Run Code Online (Sandbox Code Playgroud)
同 tapply
as.data.frame(tapply(mydata$ClaimDay, mydata$day, length), responseName='ClaimDay')
Run Code Online (Sandbox Code Playgroud)
同 by
by(mydata$ClaimDay, mydata$day, length, simplify = TRUE)
Run Code Online (Sandbox Code Playgroud)
同 dplyr
library(dplyr)
mydata %>% count(day)
Run Code Online (Sandbox Code Playgroud)
同 data.table
library(data.table)
data.table(mydata)[,(ClaimDay=length(ClaimDay)),by=day]
Run Code Online (Sandbox Code Playgroud)
同 plyr
library(plyr)
ddply(mydata,~day,summarise,ClaimDay=length(day))
Run Code Online (Sandbox Code Playgroud)
同 sqldf
library(sqldf)
sqldf('select count(ClaimDay) as ClaimDay, day from mydata group by day')
# ClaimDay day
#1 2 1970-01-01
#2 1 1970-01-02
#3 1 1970-01-03
#4 2 1970-01-04
Run Code Online (Sandbox Code Playgroud)
和基准测试结果:
library('microbenchmark')
microbenchmark(agg=aggregate(ClaimDay~day,FUN=length,data=mydata),
dplyr=mydata %>% dplyr:::count(day),
data.table=data.table(mydata)[,(ClaimDay=length(ClaimDay)),by=day],
plyr=ddply(mydata,~day,summarise,ClaimDay=length(day)),
tapply=as.data.frame(tapply(mydata$ClaimDay, mydata$day, length), responseName='ClaimDay'),
sqldf=sqldf('select count(ClaimDay) as ClaimDay, day from mydata group by day'),
by=by(mydata$ClaimDay, mydata$day, length, simplify = TRUE),
times=500)
Unit: microseconds
expr min lq mean median uq max neval cld
agg 1280.399 1408.2675 1655.8207 1458.9445 1845.331 7732.426 500 c
dplyr 1019.102 1177.3345 1350.3923 1220.0995 1356.736 3835.208 500 b
data.table 1690.092 1883.8190 2208.6055 1957.1630 2234.283 5493.653 500 d
plyr 2334.995 2482.7495 2847.0871 2554.5960 2944.404 6620.096 500 e
tapply 226.658 273.0580 342.0902 304.0635 353.244 2748.965 500 a
sqldf 8395.718 9057.0870 10458.0976 9440.2650 11389.515 61480.071 500 f
by 353.243 415.0395 492.2115 449.2520 509.765 4331.287 500 a
Run Code Online (Sandbox Code Playgroud)
小智 3
如果您不介意dplyr解决方案,这适用于您的示例数据
library(dplyr)
df %>% select(ClaimDay, day) %>%
group_by(day) %>%
mutate(ClaimDay.count = n()) %>%
slice(1)
Run Code Online (Sandbox Code Playgroud)