emp*_*rum 1 r ggplot2 geom-text geom-tile
我geom_tile()与geom_textfrom一起使用ggplot2基本上生成一个表:
两个因素(X 和 Y)绘制在 x 轴上(产生两列图块)。因子 Y 的水平嵌套在因子 X 内。因此,绘制了因子 X 的多个相同图块(每个因子 Y 水平一个图块)。有没有办法将因子 X 的图块“合并”为更大的图块,并且每个图块仅出现一次文本?我也愿意接受使用其他功能来geom_tile()实现这一目标的方法。
结果应该是这样的:
这是我的代码:
library(ggplot2)
dat <- data.frame(id = c(1:4, 1:4),
factor = c(rep("X", times = 4), rep("Y", times = 4)),
value = c("A", "A", "B", "B", "C", "D", "E", "F"))
ggplot(dat, aes(y = id, x = factor)) +
geom_tile(color = "black", fill = NA) +
geom_text(aes(label = value))
Run Code Online (Sandbox Code Playgroud)
y = id必须保存。
在我的实际数据中,id 是一个因素:
dat <- data.frame(id = factor(c("n1", "n2", "n3", "n4", "n1", "n2", "n3", "n4")),
factor = c(rep("X", times = 4), rep("Y", times = 4)),
value = c("A", "A", "B", "B", "C", "D", "E", "F"))
Run Code Online (Sandbox Code Playgroud)
使用 Allan Cameron 提供的代码生成每个因子级别多个“合并”单元格问题的 reprex:
dat <- data.frame(id = factor(c("n1", "n2", "n3", "n4", "n1", "n2", "n3", "n4")),
factor = rep(c('X', 'Y'), each = 4),
value = c('A', 'C', 'B', 'C', 'D', 'E', 'F', 'G'))
dat %>%
mutate(id = as.numeric(factor(id))) %>%
group_by(factor) %>%
mutate(chunk = data.table::rleid(value)) %>%
group_by(factor, chunk, value) %>%
summarise(y = n()) %>%
group_by(factor) %>%
mutate(height = y) %>%
mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
ggplot(aes(y = y, x = factor)) +
geom_tile(aes(height = height), color = "black", fill = NA) +
geom_text(aes(label = value)) +
scale_y_continuous('id', breaks = seq_along(unique(dat$id)),
labels = levels(factor(dat$id)))
Run Code Online (Sandbox Code Playgroud)
输出:
应针对每个因素合并包含“C”的行。
我的真实数据的子集:
dat <- structure(list(id = structure(c(3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L,
3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L,
3L, 4L, 5L, 6L, 1L, 7L, 2L, 8L),
levels = c("n374", "n673", "n139", "n2015",
"n344", "n36", "n467", "n76"),
class = "factor"),
factor = structure(c(1L,1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L,3L, 3L, 3L, 3L, 3L, 3L),
levels = c("phylum", "class", "genus"),
class = "factor"),
value = c("Proteobacteria", "Proteobacteria",
"Proteobacteria", "Proteobacteria",
"Bacteroidetes", "Proteobacteria",
"Bacteroidetes", "Proteobacteria",
"Alphaproteobacteria", "Betaproteobacteria",
"Alphaproteobacteria", "Alphaproteobacteria",
"Cytophagia", "Betaproteobacteria",
"Chitinophagia", "Betaproteobacteria",
"Sphingomonas", "Aquabacterium",
"Dongia", "Sphingomonas", "Chryseolinea",
"unidentified", "unidentified","Sphaerotilus")),
row.names = c(NA, -24L),
class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)
这会产生以下输出,其中包含艾伦·卡梅伦编辑的代码:
[![在此处输入图像描述][2]][2]
我想要的是:
Allan Cameron 的最新代码仍然会导致“较低级别”图块在未嵌套在同一“较高级别”图块中时进行合并。这会影响“未识别”值:
这可以用以下数据重现:
dat <- structure(list(id = structure(c(3L, 4L, 5L, 1L, 6L, 2L,
3L, 4L, 5L, 1L, 6L, 2L,
3L, 4L, 5L, 1L, 6L, 2L),
levels = c("OTU_374", "OTU_673", "OTU_139",
"OTU_344", "OTU_36", "OTU_467"),
class = "factor"),
factor = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
levels = c("phylum", "class", "genus"),
class = "factor"),
value = c("Proteobacteria", "Proteobacteria",
"Proteobacteria", "Bacteroidetes",
"Proteobacteria", "Bacteroidetes",
"Alphaproteobacteria", "Alphaproteobacteria",
"Alphaproteobacteria", "Cytophagia",
"Betaproteobacteria", "Chitinophagia",
"Sphingomonas", "unidentified", "Sphingomonas",
"Chryseolinea", "unidentified", "unidentified")),
row.names = c(NA, -18L),
class = c("tbl_df", "tbl", "data.frame"))
Run Code Online (Sandbox Code Playgroud)
如果“较低级别”图块已经被“较高级别”图块分隔开,则它们不应合并。
你可以使用geom_col:
ggplot(dplyr::count(dat, value, factor), aes(y = n, x = factor)) +
geom_col(color = "black", fill = NA, position = 'stack', width = 1) +
geom_text(aes(label = value), position = position_stack(vjust = 0.5))
Run Code Online (Sandbox Code Playgroud)
但更通用的解决方案geom_tile是计算每个瓷砖的中心点和高度,将后者映射到height美学上。
library(tidyverse)
dat %>%
mutate(id = as.numeric(factor(id))) %>%
group_by(factor) %>%
mutate(chunk = data.table::rleid(value)) %>%
group_by(factor, chunk, value) %>%
summarise(y = n()) %>%
group_by(factor) %>%
mutate(height = y) %>%
mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
ggplot(aes(y = y, x = factor)) +
geom_tile(aes(height = height), color = "black", fill = NA) +
geom_text(aes(label = value)) +
scale_y_continuous('id', breaks = seq_along(unique(dat$id)),
labels = levels(factor(dat$id)))
Run Code Online (Sandbox Code Playgroud)
这还允许合并非连续的块,例如,如果您的数据是
dat <- data.frame(id = c(1:7, 1:7),
factor = rep(c('X', 'y'), each = 7),
value = c('A', 'A', 'B', 'B', 'A', 'B', 'B',
'C', 'D', 'E', 'F', 'B', 'B', 'B'))
Run Code Online (Sandbox Code Playgroud)
然后你会得到
这样 ID 始终与正确的值匹配,并且 ID 排序优先于单元格合并。
编辑
有了一些现在可用的实际数据,以及来自 OP 的新信息,我们可以做:
dat2 <- dat %>%
mutate(factor = paste0(factor, '_value')) %>%
pivot_wider(names_from = factor, values_from = value) %>%
arrange(phylum_value, class_value, genus_value) %>%
mutate(id = factor(id, id)) %>%
group_by(phylum_value) %>%
mutate(phylum_chunk = cur_group_id()) %>%
group_by(phylum_value, class_value) %>%
mutate(class_chunk = cur_group_id()) %>%
group_by(phylum_value, class_value, genus_value) %>%
mutate(genus_chunk = cur_group_id()) %>%
pivot_longer(phylum_value:genus_chunk, names_sep = '_',
names_to = c('factor', '.value'))
dat2 %>%
group_by(factor, chunk, value) %>%
summarise(y = n()) %>%
mutate(factor = factor(factor, c('phylum', 'class', 'genus'))) %>%
group_by(factor) %>%
mutate(height = y) %>%
mutate(y = (cumsum(y) + cumsum(lag(y, 1, 0)))/2 + 0.5) %>%
ggplot(aes(y = y, x = factor)) +
geom_tile(aes(height = height), color = "black", fill = NA) +
geom_text(aes(label = value)) +
scale_y_continuous('id', breaks = seq_along(levels(dat2$id)),
labels = levels(dat2$id))
Run Code Online (Sandbox Code Playgroud)