Hello Stackoverflow用户,
我对R中的聚合函数的结果有一个问题.我的目的是从数据集中选择某些鸟类,并计算被调查区域内观察到的个体的密度.为此,我获取了主数据文件的一个子集,然后在区域上聚合,计算平均值和个体数量(由向量长度表示).然后我想用计算出的平均面积和个体数来计算密度.那没用.我使用的代码如下:
> head(data)
positionmonth positionyear quadrant Species Code sum_areainkm2
1 5 2014 1 Bar-tailed Godwit 5340 155.6562
2 5 2014 1 Bar-tailed Godwit 5340 155.6562
3 5 2014 1 Bar-tailed Godwit 5340 155.6562
4 5 2014 1 Bar-tailed Godwit 5340 155.6562
5 5 2014 1 Gannet 710 155.6562
6 5 2014 1 Bar-tailed Godwit 5340 155.6562
sub.gannet<-subset(data, species == "Gannet")
sub.gannet<-data.frame(sub.gannet)
x<-sub.gannet
aggr.gannet<-aggregate(sub.gannet$sum_areainkm2, by=list(sub.gannet$positionyear, sub.gannet$positionmonth, sub.gannet$quadrant, sub.gannet$Species, sub.gannet$Code), FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
names(aggr.gannet)<-c("positionyear", "positionmonth", "quadrant", "species", "code", "x")
aggr.gannet<-data.frame(aggr.gannet)
> aggr.gannet
positionyear positionmonth quadrant species code x.observed_area x.NoInd
1 2014 5 4 Gannet 710 79.8257 10.0000
density <- c(aggr.gannet$x.NoInd/aggr.gannet$x.observed_area)
aggr.gannet <- cbind(aggr.gannet, density)
Error in data.frame(..., check.names = FALSE) :
Arguments imply differing number of rows: 1, 0
> density
numeric(0)
> aggr.gannet$x.observed_area
NULL
> aggr.gannet$x.NoInd
NULL
Run Code Online (Sandbox Code Playgroud)
R似乎没有将函数(observed_area和NoInd)的结果视为数值本身.当我无法给每个人起名字时,这已经很明显了,但不得不称他们为"x".
有人可以告诉我,为什么会这样.在这些情况下如何计算密度?或者是否有另一种方法可以在同一个变量上聚合多个函数,从而产生可用的输出?
任何想法将不胜感激,谢谢!
这是一个具有多个聚合的聚合的怪癖,结果聚合存储在与聚合变量相关的列中的列表中.
摆脱这种情况的最简单方法是通过as.list
之前的方法as.dataframe
来平衡数据结构.
aggr.gannet <- as.data.frame(as.list(aggr.gannet))
Run Code Online (Sandbox Code Playgroud)
它仍然会x
用作名称.我发现修复此问题的方法是使用公式接口aggregate
,因此您的聚合看起来更像
aggr.gannet<-aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=sub.gannet,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))
Run Code Online (Sandbox Code Playgroud)
穿过它(这里我没有采用子集来说明物种的聚合)
df <- structure(list(positionmonth = c(5L, 5L, 5L, 5L, 5L, 5L), positionyear = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L), quadrant = c(1L, 1L, 1L, 1L, 1L, 1L), Species = structure(c(1L, 1L, 1L, 1L, 2L, 1L), .Label = c("Bar-tailed Godwit", "Gannet"), class = "factor"), Code = c(5340L, 5340L, 5340L, 5340L, 710L, 5340L), sum_areainkm2 = c(155.6562, 155.6562, 155.6562, 155.6562, 155.6562, 155.6562)), .Names = c("positionmonth", "positionyear", "quadrant", "Species", "Code", "sum_areainkm2"), class = "data.frame", row.names = c(NA, -6L))
df.agg <- as.data.frame(as.list(aggregate(
sum_areainkm2 ~ positionyear + positionmonth +
quadrant + Species + Code,
data=df,
FUN=function(x) c(observed_area=mean(x), NoInd=length(x)))))
Run Code Online (Sandbox Code Playgroud)
这导致你想要的:
> df.agg
positionyear positionmonth quadrant Species Code
1 2014 5 1 Gannet 710
2 2014 5 1 Bar-tailed Godwit 5340
sum_areainkm2.observed_area sum_areainkm2.NoInd
1 155.6562 1
2 155.6562 5
> names(df.agg)
[1] "positionyear" "positionmonth"
[3] "quadrant" "Species"
[5] "Code" "sum_areainkm2.observed_area"
[7] "sum_areainkm2.NoInd"
Run Code Online (Sandbox Code Playgroud)
这里强制性注意到,这dplyr
和data.table
是允许非常简单而有效地做这样的聚集强大的库.
Dplyr有一些奇怪的语法(%>%
运算符),但最终具有很强的可读性,并允许链接更复杂的操作
> require(dplyr)
> df %>%
group_by(positionyear, positionmonth, quadrant, Species, Code) %>%
summarise(observed_area=mean(sum_areainkm2), NoInd = n())
Run Code Online (Sandbox Code Playgroud)
data.table具有更紧凑的语法,对于大型数据集可能更快.
dt[,
.(observed_area=mean(sum_areainkm2), NoInd=.N),
by=.(positionyear, positionmonth, quadrant, Species, Code)]
Run Code Online (Sandbox Code Playgroud)