对连续数据进行分箱并保持正确的顺序 - 在 ggplot2 中自动组织轴标签?

sta*_*oob 1 r ggplot2

我模拟了一个数据集,进行了一些数据操作(以一种非常笨拙的方式)并绘制了以下图。

模拟数据:

# Step 1 : Simulate Data

set.seed(123)
Hospital_Visits = sample.int(20,  5000, replace = TRUE)
Weight = rnorm(5000, 90, 10)

disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)

my_data = data.frame(Weight, Hospital_Visits, Disease)

my_data$hospital_ntile <- cut(my_data$Hospital_Visits, breaks = c(0, 5, 10, Inf), labels = c("Less than 5", "5 to 10", "More than 10"), include.lowest = TRUE)
Run Code Online (Sandbox Code Playgroud)

数据处理:

# Step 2: Data Manipulation:

my_data$weight_ntile <- cut(my_data$Weight, breaks = seq(min(my_data$Weight), max(my_data$Weight), by = (max(my_data$Weight) - min(my_data$Weight)) / 10), include.lowest = TRUE)


# Create a dataset for rows where hospital_ntile = 'Less than 5'
df1 <- subset(my_data, hospital_ntile == "Less than 5")

# Create a dataset for rows where hospital_ntile = '5 to 10'
df2 <- subset(my_data, hospital_ntile == "5 to 10")

# Create a dataset for rows where hospital_ntile = 'More than 10'
df3 <- subset(my_data, hospital_ntile == "More than 10")

avg_disease_rate_df1 <- tapply(df1$Disease == "Yes", df1$weight_ntile, mean)
avg_disease_rate_df2 <- tapply(df2$Disease == "Yes", df2$weight_ntile, mean)
avg_disease_rate_df3 <- tapply(df3$Disease == "Yes", df3$weight_ntile, mean)

avg_disease_rate_df1[is.na(avg_disease_rate_df1)] <- 0
avg_disease_rate_df2[is.na(avg_disease_rate_df2)] <- 0
avg_disease_rate_df3[is.na(avg_disease_rate_df3)] <- 0

#transform into dataset

names = names(avg_disease_rate_df1)
rate_1 = as.numeric(avg_disease_rate_df1)
rate_2 = as.numeric(avg_disease_rate_df2)
rate_3 = as.numeric(avg_disease_rate_df3)

# stack data
d1 = data.frame(class = "Less than 5", names = names, rate = rate_1)
d2 = data.frame(class = "5 to 10", names = names, rate = rate_2)
d3 = data.frame(class = "More than 10", names = names, rate = rate_3)

plot_data = rbind(d1, d2, d3)
Run Code Online (Sandbox Code Playgroud)

制作情节:

library(ggplot2)
ggplot(plot_data, aes(x=names, y=rate, group = class,  color=class)) + geom_point() + geom_line() +  theme_bw()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

由于某种原因,x 轴上的顺序不按顺序- 现在它随机出现,我想将其从小到大排列。

我查阅了一些参考资料,其中显示了如何手动更改此设置 - 但是ggplot2 中是否有一些选项允许自动更正此顺序?

谢谢!

cb1*_*b14 5

编辑 - 您可以在 1) 绘图步骤或 2) 数据操作步骤中执行此操作。

1) - 在绘图步骤中执行此操作的选项

我认为最简单的方法是将 x 轴变量转换为一个因子并对其进行排序。

现在是一个角色

str(plot_data)
Run Code Online (Sandbox Code Playgroud)
'data.frame':   30 obs. of  3 variables:
 $ class: chr  "Less than 5" "Less than 5" "Less than 5" "Less than 5" ...
 $ names: chr  "[52.6,59.9]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" ...
 $ rate : num  0.6 0.1 0.339 0.399 0.438 ...
Run Code Online (Sandbox Code Playgroud)

所以你可以把它变成一个因素,然后检查水平:

plot_data$names <- as.factor(plot_data$names)
levels(plot_data$names)
Run Code Online (Sandbox Code Playgroud)

这将以某种随机的顺序显示它们:

[1] "(104,111]"   "(111,118]"   "(118,125]"   "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" "(81.8,89.1]" "(89.1,96.3]" "(96.3,104]" 
[10] "[52.6,59.9]"
Run Code Online (Sandbox Code Playgroud)

然后您可以使用库重新调整它们forcats(还有其他选项,但我喜欢这个):

plot_data$names <- fct_relevel(plot_data$names,
                               c("[52.6,59.9]", "(59.9,67.2]", "(67.2,74.5]", 
                                 "(74.5,81.8]", "(81.8,89.1]", "(89.1,96.3]", 
                                 "(96.3,104]", "(104,111]", "(111,118]" ))
Run Code Online (Sandbox Code Playgroud)

然后你的情节将如下所示:

ggplot(plot_data, aes(x=names, y=rate, group = class,  color=class)) + 
  geom_point() + 
  geom_line() +  
  theme_bw()
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

2) - 在数据操作步骤执行此操作的选项

当您weight_ntile使用cut()它时,您可以要求排序结果(ordered_result = TRUE),这将为您提供排序因子。但是,如果保持原样,其余的数据操作将消除该有序因子。相反,您可以在管道系列中使用dplyr和来完成这一切tidyr。这是一种方法:

plot_data$names <- as.factor(plot_data$names)
levels(plot_data$names)
Run Code Online (Sandbox Code Playgroud)

然后可以用同样的方式绘制情节:

[1] "(104,111]"   "(111,118]"   "(118,125]"   "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" "(81.8,89.1]" "(89.1,96.3]" "(96.3,104]" 
[10] "[52.6,59.9]"
Run Code Online (Sandbox Code Playgroud)

在此输入图像描述

  • 哦,我想我有一个答案,您可以在剪切函数中完成它,但随后需要重新格式化其余数据操作的方式。这就是我的做法,我将编辑我的答案! (2认同)