使用 hexbins 显示分类变量的比例(如 hextri)

Sam*_*ore 8 r ggplot2

通过该网站的建议,我在 ggplot 中构建了一个 hexbin 图,它显示每个 bin 中的数据点计数,并突出显示感兴趣的特定 bin

我现在想进一步扩展此图,以显示每个六边形中第二个分组类别的比例。这已经可以通过hextri包实现,但我无法将上一个问题的 ggplot 解决方案与 hextri 包的输出结合起来。

最终目标是得到一个看起来像 hextri 包的输出的图,并突出显示感兴趣的单元格。

下面是一些示例数据代码,可以创建带有突出显示的单元格的 ggplot 以及显示分类比例的 hextri 图。我想将这两个功能结合起来。

我尝试使用 hextri 函数的边界输入来实现所需的结果,但尚未成功。

library(hextri)
library(ggplot2)

n = 100

df = data.frame(x = rnorm(n), 
                y = rnorm(n),
                group = sample(0:1, n, prob = c(0.9, 0.1), replace = TRUE))

# hextri plot
hextri_plot = hextri(
  df$x,
  df$y,
  class = df$group,
  colour = c("#00b38a", "#ea324c"),
  nbins = 3,
  diffuse = FALSE, 
  sorted = TRUE
) 


# GGplot
ggplot(df, aes(x = x, y = y)) +
  geom_hex() +
  stat_summary_hex(aes(
    z = group,
    color = after_stat(as.character(value))
  ), fun = ~ +any(.x == 1), fill = NA) +
  scale_color_manual(
    values = c("0" = "transparent", "1" = "yellow"),
    guide = "none"
  )


Run Code Online (Sandbox Code Playgroud)

All*_*ron 17

这不是一个小问题。它需要编写一个新的Geom、一个新的Stat和一个新的Grob(见下文)。我个人并不相信它是一个很好的数据可视化选项,因为它会导致位置扭曲并且涉及显着的舍入误差。然而,它在视觉上很有吸引力并且相当直观,所以我还是继续写了一个geom_hextri。为了让它发挥作用,我们只需将其美学映射到一个分类变量,它的行为就应该与预期的一样。

让我们使用您自己的示例数据:

set.seed(1)
n = 100

df = data.frame(x = rnorm(n), 
                y = rnorm(n),
                group = sample(0:1, n, prob = c(0.9, 0.1), replace = TRUE))
Run Code Online (Sandbox Code Playgroud)

geom_hextri使用您选择的配色方案绘制它。我们将覆盖点,以便确保段填充的逻辑与点匹配。

ggplot(df, aes(x, y, fill = factor(group), color = factor(group))) + 
  geom_hextri(linewidth = 0.3, bins = 4) + 
  geom_point(shape = 21, size = 3, color = "black") +
  coord_equal() + 
  theme_classic(base_size = 16) + 
  theme(aspect.ratio = 1) +
  scale_fill_manual("Group", values =  c("#00b38a", "#ea324c")) +
  scale_color_manual("Group", values =  c("#00b38a", "#ea324c"))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

请注意,如果我们愿意,可以轻松更改垃圾箱的大小和美观。为了在三角形周围获得实心六边形,我们只需添加一个geom_hex图层:

ggplot(df, aes(x, y, fill = factor(group))) + 
  geom_hextri(color = "black", linewidth = 0.1, bins = 5) + 
  geom_point(shape = 21, size = 3) +
  geom_hex(fill = NA, color = "black", linewidth = 1, bins = 5) +
  coord_equal() + 
  theme_classic(base_size = 16) + 
  theme(aspect.ratio = 1) +
  scale_fill_manual("Group", values = c("gray", "red"))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

应用于另一个数据集我们得到:

ggplot(df, aes(x, y, fill = factor(group))) + 
  geom_hextri(color = "black", linewidth = 0.1, bins = 5) + 
  geom_point(shape = 21, size = 3) +
  geom_hex(fill = NA, color = "black", linewidth = 1, bins = 5) +
  coord_equal() + 
  theme_classic(base_size = 16) + 
  theme(aspect.ratio = 1) +
  scale_fill_manual("Group", values = c("gray", "red"))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

另请注意,我们不需要使用填充美学。例如,我们可以简单地更改轮廓颜色:

ggplot(iris, aes(Sepal.Width, Sepal.Length, fill = Species)) + 
  geom_hextri(color = "white", linewidth = 0.1, bins = 5) + 
  geom_point(shape = 21, size = 3, position = position_jitter(0.03, 0.03),
             color = "white") +
  geom_hex(fill = NA, colour = NA, linewidth = 1, bins = 5) +
  coord_equal() + 
  theme_minimal(base_size = 20) + 
  theme(aspect.ratio = 1) +
  scale_fill_brewer(palette = "Set2")
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述


代码为geom_hextri

现在困难的部分是实施geom_hextri。我试图将其分解为几块,但代码必然很长,而且很难详细解释。我还必须牺牲一点间距,以使其适合不需要滚动的代码框。


最终,ggplot 必须在绘图设备上将对象绘制为图形对象 ( grobs),但是没有现成的现成工具grob可以绘制这些六边形线段,因此我们需要定义一个函数,在grid::polygonGrob给定适当的 x 的情况下使用 来绘制它们、y 坐标、高度、宽度、图形参数以及我们正在处理的段。这需要接受矢量化数据才能与 ggplot 一起使用:

ggplot(iris, aes(Sepal.Width, Sepal.Length, colour = Species)) + 
  geom_hextri(fill = NA, linewidth = 2, bins = 5, alpha = 1) + 
  geom_hex(fill = NA, colour = NA, linewidth = 1, bins = 5) +
  coord_equal() + 
  theme_minimal(base_size = 20) + 
  theme(aspect.ratio = 1) +
  scale_colour_brewer(palette = "Set1")
Run Code Online (Sandbox Code Playgroud)

但这本身还不够。我们还需要定义一个geom继承自 的GeomHex,但有它自己的compute_group方法来适当地调用我们的hextriGrob函数。其工作的一部分是确保美学被正确地分割成多个部分,由于技术原因,这不能在一个Stat层内轻松完成。

hextriGrob <- function(x, y, seg, height, width, gp = grid::gpar()) {

  gp <- lapply(seq_along(x), function(i) structure(gp[i], class = "gpar"))
  xl  <- x - width
  xr  <- x + width
  y1  <- y + 2 * height
  y2  <- y + height
  y3  <- y - height
  y4  <- y - 2 * height
  pg  <- grid::polygonGrob
  
  do.call(grid::gList, 
    Map(function(x, y, xl, xr, y1, y2, y3, y4, seg, gp) {
      if(seg == 1) return(pg(x = c(x, x, xr, x),  y = c(y, y1, y2, y), gp = gp))
      if(seg == 2) return(pg(x = c(x, xr, xr, x), y = c(y, y2, y3, y), gp = gp))
      if(seg == 3) return(pg(x = c(x, xr, x, x),  y = c(y, y3, y4, y), gp = gp))
      if(seg == 4) return(pg(x = c(x, x, xl, x),  y = c(y, y4, y3, y), gp = gp))
      if(seg == 5) return(pg(x = c(x, xl, xl, x), y = c(y, y3, y2, y), gp = gp))
      if(seg == 6) return(pg(x = c(x, xl, x, x),  y = c(y, y2, y1, y), gp = gp))
  }, x = x, y = y, xl = xl, xr = xr, y1 = y1, 
     y2 = y2, y3 = y3, y4 = y4, seg = seg, gp = gp))
}
Run Code Online (Sandbox Code Playgroud)

在我们的数据到达这个几何图形之前,需要将其装箱为六边形。不幸的是,现有的StatBinhex无法做到这一点,同时保留我们需要的各个段级美学细节,因此我们必须编写自己的分箱函数:

GeomHextri <- ggproto("GeomHextri", GeomHex,
  draw_group = function (self, data, panel_params, coord, lineend = "butt",
                         linejoin = "mitre", linemitre = 10) {
    table_six <- function(vec) {
      if(!is.factor(vec)) vec <- factor(vec)
      tab <- round(6 * table(vec, useNA = "always")/length(vec))
      n <- diff(c(0, findInterval(cumsum(tab) / sum(tab), 1:6/6)))
      rep(names(tab), times = n)
    }
    num_cols <- sapply(data, is.numeric)
    non_num_cols <- names(data)[!num_cols]
    num_cols <- names(data)[num_cols]
    datasplit <- split(data, interaction(data$x, data$y, drop = TRUE))
    data <- do.call("rbind", lapply(seq_along(datasplit), function(i) {
      num_list <- lapply(datasplit[[i]][num_cols], function(x) rep(mean(x), 6))
      non_num_list <- lapply(datasplit[[i]][non_num_cols], function(x) {
        table_six(rep(x, times = datasplit[[i]]$count))})
      d <- datasplit[[i]][rep(1, 6),]
      d[num_cols] <- num_list
      d[non_num_cols] <- non_num_list
      d$tri <- 1:6
      d$group <- i
      d}))
    data <- ggplot2:::check_linewidth(data, snake_class(self))
    if (ggplot2:::empty(data))  return(zeroGrob())
    coords <- coord$transform(data, panel_params)
    hw <- c(min(diff(unique(sort(coords$x)))), 
            min(diff(unique(sort(coords$y))))/3)
    hextriGrob(coords$x, coords$y, data$tri, hw[2], hw[1],
      gp = grid::gpar(col = data$colour, fill = alpha(data$fill, data$alpha),
                      lwd = data$linewidth * .pt, lty = data$linetype,
                      lineend = lineend, linejoin = linejoin,
                      linemitre = linemitre))})
Run Code Online (Sandbox Code Playgroud)

然后必须在自定义内部使用它Stat

hexify <- function (x, y, z, xbnds, ybnds, xbins, ybins, binwidth,
                    fun = mean, fun.args = list(),
                    drop = TRUE) {

  hb <- hexbin::hexbin(x, xbnds = xbnds, xbins = xbins, y,
                       ybnds = ybnds, shape = ybins/xbins, IDs = TRUE)
  value <- rlang::inject(tapply(z, hb@cID, fun, !!!fun.args))
  out <- hexbin::hcell2xy(hb)
  out <- ggplot2:::data_frame0(!!!out)
  out$value <- as.vector(value)
  out$width <- binwidth[1]
  out$height <- binwidth[2]
  if (drop) out <- stats::na.omit(out)
  out
}
Run Code Online (Sandbox Code Playgroud)

最后,我们需要编写一个 geom 函数,以便我们可以在 ggplot 调用中轻松调用上述所有内容:

StatHextri <- ggproto("StatBinhex", StatBinhex,
  default_aes = aes(weight = 1, alpha = after_stat(count)),
  compute_panel = function (self, data, scales, ...) {
    if (ggplot2:::empty(data)) return(ggplot2:::data_frame0())
    data$group <- 1
    self$compute_group(data = data, scales = scales, ...)},
  compute_group = function (data, scales, binwidth = NULL, bins = 30,
                            na.rm = FALSE){
    `%||%` <- rlang::`%||%`
    rlang::check_installed("hexbin", reason = "for `stat_binhex()`")
    binwidth <- binwidth %||% ggplot2:::hex_binwidth(bins, scales)
    if (length(binwidth) == 1) binwidth <- rep(binwidth, 2)
    wt <- data$weight %||% rep(1L, nrow(data))
    non_pos <- !names(data) %in% c("x", "y", "PANEL", "group")
    is_num  <- sapply(data, is.numeric)
    aes_vars <- names(data)[non_pos & !is_num]
    grps <- do.call("interaction", c(as.list(data[aes_vars]), drop = TRUE))
    xbnds <- ggplot2:::hex_bounds(data$x, binwidth[1])
    xbins <- diff(xbnds)/binwidth[1]
    ybnds <- ggplot2:::hex_bounds(data$y, binwidth[2])
    ybins <- diff(ybnds)/binwidth[2]
    do.call("rbind", Map(function(data, wt) {
      out <- hexify(data$x, data$y, wt, xbnds, ybnds, xbins,
                    ybins, binwidth, sum)
      for(var in aes_vars) out[[var]] <- data[[var]][1]
      out$density <- as.vector(out$value/sum(out$value, na.rm = TRUE))
      out$ndensity <- out$density/max(out$density, na.rm = TRUE)
      out$count <- out$value
      out$ncount <- out$count/max(out$count, na.rm = TRUE)
      out$value <- NULL
      out$group <- 1
      out}, split(data, grps), split(wt, grps)))})
Run Code Online (Sandbox Code Playgroud)