合并相同的图块

emp*_*rum 1 r ggplot2 geom-text geom-tile

geom_tile()geom_textfrom一起使用ggplot2基本上生成一个表:

在此输入图像描述

两个因素(X 和 Y)绘制在 x 轴上(产生两列图块)。因子 Y 的水平嵌套在因子 X 内。因此,绘制了因子 X 的多个相同图块(每个因子 Y 水平一个图块)。有没有办法将因子 X 的图块“合并”为更大的图块,并且每个图块仅出现一次文本?我也愿意接受使用其他功能来geom_tile()实现这一目标的方法。

结果应该是这样的:

在此输入图像描述

这是我的代码:

library(ggplot2)

dat <- data.frame(id = c(1:4, 1:4),
                  factor = c(rep("X", times = 4), rep("Y", times = 4)),
                  value = c("A", "A", "B", "B", "C", "D", "E", "F"))

ggplot(dat, aes(y = id, x = factor)) +
  geom_tile(color = "black", fill = NA) +
  geom_text(aes(label = value))
Run Code Online (Sandbox Code Playgroud)

y = id必须保存。

编辑1

在我的实际数据中,id 是一个因素:

dat <- data.frame(id = factor(c("n1", "n2", "n3", "n4", "n1", "n2", "n3", "n4")),
                  factor = c(rep("X", times = 4), rep("Y", times = 4)),
                  value = c("A", "A", "B", "B", "C", "D", "E", "F"))
Run Code Online (Sandbox Code Playgroud)

编辑2

使用 Allan Cameron 提供的代码生成每个因子级别多个“合并”单元格问题的 reprex:

dat <- data.frame(id = factor(c("n1", "n2", "n3", "n4", "n1", "n2", "n3", "n4")),
                  factor = rep(c('X', 'Y'), each = 4),
                  value  = c('A', 'C', 'B', 'C', 'D', 'E', 'F', 'G'))

dat %>%
  mutate(id = as.numeric(factor(id))) %>%
  group_by(factor) %>%
  mutate(chunk = data.table::rleid(value)) %>%
  group_by(factor, chunk, value) %>%
  summarise(y = n()) %>%
  group_by(factor) %>%
  mutate(height = y) %>%
  mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
  ggplot(aes(y = y, x = factor)) +
  geom_tile(aes(height = height), color = "black", fill = NA) +
  geom_text(aes(label = value)) +
  scale_y_continuous('id', breaks = seq_along(unique(dat$id)),
                     labels = levels(factor(dat$id)))
Run Code Online (Sandbox Code Playgroud)

输出:

在此输入图像描述

应针对每个因素合并包含“C”的行。

编辑3

我的真实数据的子集:

dat <- structure(list(id = structure(c(3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L, 
                                       3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L,
                                       3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L),
                                     levels = c("n374", "n673", "n139", "n2015",
                                                "n344", "n36", "n467", "n76"),
                                     class = "factor"),
                      factor = structure(c(1L,1L, 1L, 1L, 1L, 1L, 1L, 1L,
                                           2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
                                           3L, 3L,3L, 3L, 3L, 3L, 3L, 3L),
                                         levels = c("phylum", "class", "genus"),
                                         class = "factor"),
                      value = c("Proteobacteria", "Proteobacteria",
                                "Proteobacteria", "Proteobacteria",
                                "Bacteroidetes", "Proteobacteria",
                                "Bacteroidetes", "Proteobacteria",
                                "Alphaproteobacteria", "Betaproteobacteria",
                                "Alphaproteobacteria", "Alphaproteobacteria",
                                "Cytophagia", "Betaproteobacteria",
                                "Chitinophagia", "Betaproteobacteria",
                                "Sphingomonas", "Aquabacterium",
                                "Dongia", "Sphingomonas", "Chryseolinea",
                                "unidentified", "unidentified","Sphaerotilus")),
                 row.names = c(NA, -24L),
                 class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)

这会产生以下输出,其中包含艾伦·卡梅伦编辑的代码:

[![在此处输入图像描述][2]][2]

我想要的是:

  • 对于因子门:变形菌门的两个区域应合并。
  • 对于因子类别:Betaproteabacteria 的两个区域应合并。
  • 对于因子属:鞘氨醇单胞菌的两个区域应合并。
  • 对于因子属:“未识别”的两个区域不应合并,因为它们嵌套在不同级别的因子门和类中。

编辑4

Allan Cameron 的最新代码仍然会导致“较低级别”图块在未嵌套在同一“较高级别”图块中时进行合并。这会影响“未识别”值:

在此输入图像描述

这可以用以下数据重现:

dat <- structure(list(id = structure(c(3L, 4L, 5L, 1L, 6L, 2L,
                                       3L, 4L, 5L, 1L, 6L, 2L, 
                                       3L, 4L, 5L, 1L, 6L, 2L),
                                     levels = c("OTU_374", "OTU_673", "OTU_139",
                                                "OTU_344", "OTU_36", "OTU_467"),
                                     class = "factor"),
                      factor = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
                                           2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
                                         levels = c("phylum", "class", "genus"),
                                         class = "factor"),
                      value = c("Proteobacteria", "Proteobacteria",
                                "Proteobacteria", "Bacteroidetes",
                                "Proteobacteria", "Bacteroidetes",
                                "Alphaproteobacteria", "Alphaproteobacteria",
                                "Alphaproteobacteria", "Cytophagia",
                                "Betaproteobacteria", "Chitinophagia",
                                "Sphingomonas", "unidentified", "Sphingomonas",
                                "Chryseolinea", "unidentified", "unidentified")),
                 row.names = c(NA, -18L),
                 class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)

如果“较低级别”图块已经被“较高级别”图块分隔开,则它们不应合并。

All*_*ron 9

你可以使用geom_col

ggplot(dplyr::count(dat, value, factor), aes(y = n, x = factor)) +
  geom_col(color = "black", fill = NA, position = 'stack', width = 1) +
  geom_text(aes(label = value), position = position_stack(vjust = 0.5))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

但更通用的解决方案geom_tile是计算每个瓷砖的中心点和高度,将后者映射到height美学上。

library(tidyverse)

dat %>%
  mutate(id = as.numeric(factor(id))) %>%
  group_by(factor) %>%
  mutate(chunk = data.table::rleid(value)) %>%
  group_by(factor, chunk, value) %>%
  summarise(y = n()) %>%
  group_by(factor) %>%
  mutate(height = y) %>%
  mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
  ggplot(aes(y = y, x = factor)) +
  geom_tile(aes(height = height), color = "black", fill = NA) +
  geom_text(aes(label = value)) +
  scale_y_continuous('id', breaks = seq_along(unique(dat$id)),
                     labels = levels(factor(dat$id)))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

这还允许合并非连续的块,例如,如果您的数据是

dat <- data.frame(id = c(1:7, 1:7),
                  factor = rep(c('X', 'y'), each = 7),
                  value  = c('A', 'A', 'B', 'B', 'A', 'B', 'B',
                             'C', 'D', 'E', 'F', 'B', 'B', 'B'))
Run Code Online (Sandbox Code Playgroud)

然后你会得到

在此输入图像描述

这样 ID 始终与正确的值匹配,并且 ID 排序优先于单元格合并。


编辑

有了一些现在可用的实际数据,以及来自 OP 的新信息,我们可以做:

dat2 <- dat %>%
  mutate(factor = paste0(factor, '_value')) %>%
  pivot_wider(names_from = factor, values_from = value) %>% 
  arrange(phylum_value, class_value, genus_value) %>%
  mutate(id = factor(id, id)) %>%
  group_by(phylum_value) %>%
  mutate(phylum_chunk = cur_group_id()) %>%
  group_by(phylum_value, class_value) %>%
  mutate(class_chunk = cur_group_id()) %>%
  group_by(phylum_value, class_value, genus_value) %>% 
  mutate(genus_chunk = cur_group_id()) %>%
  pivot_longer(phylum_value:genus_chunk, names_sep = '_', 
               names_to = c('factor', '.value'))

dat2 %>%
  group_by(factor, chunk, value) %>%
  summarise(y = n()) %>%
  mutate(factor = factor(factor, c('phylum', 'class', 'genus'))) %>%
  group_by(factor) %>%
  mutate(height = y) %>%
  mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
  ggplot(aes(y = y, x = factor)) +
  geom_tile(aes(height = height), color = "black", fill = NA) +
  geom_text(aes(label = value)) +
  scale_y_continuous('id', breaks = seq_along(levels(dat2$id)),
                     labels = levels(dat2$id))
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述