stat_density_2d()
我分别使用(左)和(右)在 R 中创建了以下图表geom_density2d_filled()
。尽管两个图表在视觉上看起来相同,但级别却显着不同。
如何将这些价值观结合起来或解释?例如,右图中的黄色区域是否覆盖了最密集区域中观测值的 25%,青色区域是否覆盖了 50%。这些不同层次之间有何关系?
library(ggplot2)
set.seed(123)
dat <-
data.frame(
X = c(rnorm(300, 3, 2.5), rnorm(150, 7, 2)),
Y = c(rnorm(300, 6, 2.5), rnorm(150, 2, 2)))
ggplot(dat, aes(X, Y)) +
stat_density_2d(geom = "polygon",
aes(fill = after_stat(level)), bins = 4) +
geom_point(alpha = 0.1)
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "ndensity",
breaks = seq(0.25, 1, length.out = 4)
) +
geom_point(alpha = 0.1)
# EDIT to incorporate chart based on comment
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "density",
bins = 4) +
geom_point(alpha = 0.1)
Run Code Online (Sandbox Code Playgroud)
All*_*ron 14
尽管之前已经对此进行过讨论,但我想我应该在这里发布一个答案,展示如何确保每条轮廓线中包含特定比例的点。
为此,我们可以使用 获取二维密度MASS::kde2d
,然后使用 转换为栅格terra
。然后,我们可以根据关联的二维密度网格中的密度对点进行排序,并找到通过分位数的密度approx
density_quantiles <- function(x, y, quantiles) {
dens <- MASS::kde2d(x, y, n = 500)
df <- cbind(expand.grid(x = dens$x, y = dens$y), z = c(dens$z))
r <- terra::rast(df)
ind <- sapply(seq_along(x), function(i) cellFromXY(r, cbind(x[i], y[i])))
ind <- ind[order(-r[ind][[1]])]
vals <- r[ind][[1]]
ret <- approx(seq_along(ind)/length(ind), vals, xout = quantiles)$y
replace(ret, is.na(ret), max(r[]))
}
Run Code Online (Sandbox Code Playgroud)
这意味着如果我们可以指定一个分位数向量:
quantiles <- c(0, 0.2, 0.4, 0.6, 0.8)
Run Code Online (Sandbox Code Playgroud)
我们可以得到一个易于解释的图,显示包围我们点的 20%、40%、60% 和 80% 的区域,如下所示:
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "density",
breaks = density_quantiles(dat$X, dat$Y, quantiles)) +
geom_point(alpha = 0.1) +
coord_equal() +
scale_fill_viridis_d('Quantiles', l
abels = scales::percent(quantiles[-1]),
direction = -1)
Run Code Online (Sandbox Code Playgroud)
四分位数将是这样的:
quartiles <- c(0, 0.25, 0.5, 0.75)
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "density",
breaks = density_quantiles(dat$X, dat$Y, quartiles)) +
geom_point(alpha = 0.1) +
coord_equal() +
scale_fill_viridis_d('Quartiles', labels = scales::percent(quartiles[-1]),
direction = -1)
Run Code Online (Sandbox Code Playgroud)
请注意,这与图中的级别有很大不同ndensity
,后者是最大密度的比例,而不是包含固定比例点的区域。换句话说,如果您从侧面看图的 3D 表示ndensity
,则条带都将具有相同的高度,如以下动画所示(请参阅生成此图的代码的脚注):
当存在高密度区域时,代码似乎会给出“奇怪”的结果,如下所示:
set.seed(123)
dat <-
data.frame(
X = c(rnorm(300, 3, 2.5), rnorm(150, 7, 2), rnorm(450, 4, 0.5)),
Y = c(rnorm(300, 6, 2.5), rnorm(150, 2, 2), rnorm(450, 5, 0.5)))
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "density",
breaks = density_quantiles(dat$X, dat$Y, quantiles)) +
geom_point(alpha = 0.1) +
coord_equal() +
scale_fill_viridis_d('Quartiles', labels = scales::percent(quantiles[-1]),
direction = -1)
Run Code Online (Sandbox Code Playgroud)
我们可以看到第 80 个百分位数包含“岛屿”,或不连续的区域。然而,这只是因为我们仍在绘制密度,并且这些是高于阈值密度值的区域,其中包含正确的点数。无论在何处设置阈值,都不能保证密度带将是单个连续区域。
quantiles
我们可以使用版本的 3D 图清楚地看到这一点density_quantiles
,其中低密度的小“块”到处都打破了我们的阈值。
dens <- MASS::kde2d(dat$X, dat$Y, n = 1000)
levels <- as.character(cut(dens$z[-1, -1],
breaks = density_quantiles(dat$X, dat$Y, quantiles),
labels = c(scales::viridis_pal()(4))))
persp(dens, col = levels, phi = 20, theta = -20, axes = FALSE, border = NA)
Run Code Online (Sandbox Code Playgroud)
相反,如果您希望这些区域是连续的,那么您的问题就会变得不明确。例如,如果您想要连续区域,则不清楚以下密度图应如何显示:
set.seed(123)
dat <-
data.frame(
X = c(rnorm(150, 3, 2), rnorm(150, 10, 2)),
Y = c(rnorm(150, 3, 2), rnorm(150, 10, 2)))
ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "density",
breaks = density_quantiles(dat$X, dat$Y, quartiles)) +
geom_point(alpha = 0.1) +
coord_equal() +
scale_fill_viridis_d('Quartiles', labels = scales::percent(quartiles[-1]),
direction = -1)
Run Code Online (Sandbox Code Playgroud)
在此示例中,为了具有连续区域,必须选择这些簇之一作为计算点数的中心点。这将给出非常人为和误导性的结果,这些结果根本不能反映密度,而必须是距最高密度点的距离函数,因此总是一组嵌套的圆,就像牛眼一样。
动画代码
library(magick)
p1 <- ggplot(dat, aes(X, Y)) +
geom_density2d_filled(
aes(fill = after_stat(level)),
contour_var = "ndensity",
breaks = seq(0.25, 1, length.out = 4),
show.legend = FALSE
) +
geom_point(alpha = 0.1) +
coord_fixed(0.95, expand = FALSE) +
theme(plot.margin = margin(75, 50, 60, 50))
ggsave('gg.png', p1, width = 480, height = 480, units = 'px', dpi = 72)
dens <- MASS::kde2d(dat$X, dat$Y, n = 1000)
df2 <- data.frame(x = dens$x, y = apply(dens$z, 1, max)/max(dens$z))
p2 <- ggplot(df2, aes(x, y)) +
geom_area(fill = scales::viridis_pal()(3)[3]) +
geom_area(fill = scales::viridis_pal()(3)[2],
aes(y = ifelse(y > 0.75, 0.75, y))) +
geom_area(fill = scales::viridis_pal()(3)[1],
aes(y = ifelse(y > 0.5, 0.5, y))) +
geom_area(fill = 'gray92',
aes(y = ifelse(y > 0.25, 0.25, y))) +
coord_fixed(diff(range(dens$x))) +
geom_hline(yintercept = c(0, 0.25, 0.5, 0.75, 1)) +
theme_classic() +
theme(plot.margin = margin(58, 50, 50, 45))
ggsave('gg2.png', p2, width = 480, height = 480, units = 'px', dpi = 72)
levels <- as.character(cut(dens$z[-1, -1],
breaks = c(0, 0.25, 0.5, 0.75, 1) * max(dens$z),
labels = c('gray92', scales::viridis_pal()(3))))
for(i in seq(0, 90, 3)) {
ragg::agg_png(paste0("persp", sprintf("%02d", i), ".png"))
persp(dens, col = levels, phi = i, d = 1000, axes = FALSE,
box = FALSE, border = NA)
dev.off()
}
f <- list.files(pattern = 'persp\\d+\\.png', full.names = TRUE)
c(c(rep(f[31], 10), rev(f), rep(f[1], 10),
rep('gg2.png', 10), rep(f[1], 10), f, rep(f[31], 5),
rep('gg.png', 10))
) %>%
rev(.) %>%
image_read() %>%
image_join() %>%
image_animate(fps = 10) %>%
image_write("D:\\persp.gif")
Run Code Online (Sandbox Code Playgroud)