在直方图上叠加数据的一致方法(从 geom_histogram 中提取分箱数据?)

byt*_*101 8 r histogram ggplot2

我的目标是在 ggplot2 中创建这个图:

水平直方图,数据条目填满直方图

经过大量摆弄后,我成功地为这个数据集创建了它,如上面的屏幕截图所示,使用以下相当脆弱的代码(请注意width=63, boundary=410,它需要大量的试验和错误):

ex = data.frame(year=c(1971,1973,1975,1977,1979,1981,1983,1985,1987,1989,1991,1993,1995,1997,1999,2001,2003,2005,2007,2009,2011,2013,2015,2017,2019), occurances=c(347,773,589,462,280,455,1037,707,663,746,531,735,751,666,642,457,411,286,496,467,582,577,756,557,373))
ex_bin = mutate(ex, range=cut_width(occurances, width=63, boundary=410)) # bin the data
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range)) # extract range lower bound
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range)) # extract range upper bound
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,]) # extract our visual x position, based on the number of years already in this bin
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(binwidth = 63, boundary=410) + geom_text(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin) # plot!
Run Code Online (Sandbox Code Playgroud)

请注意硬编码的边界和 binwidth。这是非常脆弱的,必须进行调整才能在每个数据集的基础上工作。我怎样才能让它持续工作?我不太关心突出显示选定的年份(这里是 2019 年,只是为了显示垃圾箱中的错位),而是关心正确的标签放置。我之前对stat_bincut_numberbins=13和其他方法的尝试最终都得到了如下所示的未对齐图(我已从 切换到text以便label更清楚地显示对齐错误):

带有未对齐标签的水平直方图

ex_bin = mutate(ex, range=cut_number(occurances, n=13)) # I've also tried cut_interval
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range))
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range))
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,])
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(bins=13) + geom_label(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin)
Run Code Online (Sandbox Code Playgroud)

为什么?有什么方法可以提取并使用与 相同的数据吗geom_histogram?我尝试阅读 ggplot 代码,但无法理解执行流程。为了进一步增加混乱,使用标签放置代码通常也会重新组合geom_histogram,即使它使用的是原始数据框。这让我感到惊讶,因为对标签的每次调整都会搞砸位置,因为直方图会再次移动(请注意突出显示的垃圾箱下面的垃圾箱中的三年,而不是上面的两年):

不同箱上带有未对齐标签的水平直方图

ex_bin = mutate(ex, range=cut_width(occurances, width=63, boundary=410))
ex_bin$lower = as.numeric(sub("[\\(\\[](.+),.*", "\\1", ex_bin$range))
ex_bin$upper = as.numeric(sub("[^,]*,([^]]*)\\]", "\\1", ex_bin$range))
ex_bin$pos = as.numeric(rbind(by(ex_bin, seq_len(nrow(ex_bin)), function(ey) count(ex_bin[ex_bin$year <= ey$year & ex_bin$upper == ey$upper, ])))[1,])
ggplot(ex_bin, aes(x=occurances, fill=year==2019)) +coord_flip() + geom_histogram(bins=13) + geom_label(color="white", aes(label=year, x=(upper+lower)/2, y=pos-0.5, group=year), ex_bin)
Run Code Online (Sandbox Code Playgroud)

所以我的问题是:

  1. 如何通过指定bins=13或类似方式使该绘图类型一致地工作?有没有更简单/更容易的方法来做到这一点?
  2. 为什么geom_histogram基于“不相关”代码的重新分类如此狡猾?

ste*_*fan 5

实现您想要的结果的一种选择是也stat="bin"使用geom_text。此外,我们必须按group年排序,以便每年都是一个单独的“块”。棘手的部分是获取我使用的年份标签after_stat。然而,由于groups 在内部存储为整数序列,我们可以将它们返回到我使用辅助向量的相应年份。

library(ggplot2)
library(dplyr)

ex <- data.frame(year = c(1971, 1973, 1975, 1977, 1979, 1981, 1983, 1985, 1987, 1989, 1991, 1993, 1995, 1997, 1999, 2001, 2003, 2005, 2007, 2009, 2011, 2013, 2015, 2017, 2019), 
                occurances = c(347, 773, 589, 462, 280, 455, 1037, 707, 663, 746, 531, 735, 751, 666, 642, 457, 411, 286, 496, 467, 582, 577, 756, 557, 373))

years <- levels(factor(ex$year))
                
ggplot(ex, aes(y = occurances, fill = year == 2019, group = as.character(year), label = year)) +
  geom_histogram(binwidth = 63, boundary = 410, position = position_stack(reverse = TRUE)) +
  geom_text(color = "white", aes(label = after_stat(if_else(count > 0, as.character(years[group]), ""))), stat = "bin", 
            binwidth = 63, boundary = 410, position = position_stack(vjust = .5, reverse = TRUE))
Run Code Online (Sandbox Code Playgroud)

编辑bins当使用而不是binwidthand时,该方法也可以正常工作boundary

ggplot(ex, aes(y = occurances, fill = year == 2019, group = as.character(year), label = year)) +
  geom_histogram(bins=13, position = position_stack(reverse = TRUE)) +
  geom_text(color = "white", aes(label = after_stat(if_else(count > 0, as.character(years[group]), ""))), stat = "bin", 
            bins=13, position = position_stack(vjust = .5, reverse = TRUE))
Run Code Online (Sandbox Code Playgroud)