我模拟了一个数据集,进行了一些数据操作(以一种非常笨拙的方式)并绘制了以下图。
模拟数据:
# Step 1 : Simulate Data
set.seed(123)
Hospital_Visits = sample.int(20, 5000, replace = TRUE)
Weight = rnorm(5000, 90, 10)
disease <- c("Yes","No")
disease <- sample(disease, 5000, replace=TRUE, prob=c(0.4, 0.6))
Disease <- as.factor(disease)
my_data = data.frame(Weight, Hospital_Visits, Disease)
my_data$hospital_ntile <- cut(my_data$Hospital_Visits, breaks = c(0, 5, 10, Inf), labels = c("Less than 5", "5 to 10", "More than 10"), include.lowest = TRUE)
Run Code Online (Sandbox Code Playgroud)
数据处理:
# Step 2: Data Manipulation:
my_data$weight_ntile <- cut(my_data$Weight, breaks = seq(min(my_data$Weight), max(my_data$Weight), by = (max(my_data$Weight) - min(my_data$Weight)) / 10), include.lowest = TRUE)
# Create a dataset for rows where hospital_ntile = 'Less than 5'
df1 <- subset(my_data, hospital_ntile == "Less than 5")
# Create a dataset for rows where hospital_ntile = '5 to 10'
df2 <- subset(my_data, hospital_ntile == "5 to 10")
# Create a dataset for rows where hospital_ntile = 'More than 10'
df3 <- subset(my_data, hospital_ntile == "More than 10")
avg_disease_rate_df1 <- tapply(df1$Disease == "Yes", df1$weight_ntile, mean)
avg_disease_rate_df2 <- tapply(df2$Disease == "Yes", df2$weight_ntile, mean)
avg_disease_rate_df3 <- tapply(df3$Disease == "Yes", df3$weight_ntile, mean)
avg_disease_rate_df1[is.na(avg_disease_rate_df1)] <- 0
avg_disease_rate_df2[is.na(avg_disease_rate_df2)] <- 0
avg_disease_rate_df3[is.na(avg_disease_rate_df3)] <- 0
#transform into dataset
names = names(avg_disease_rate_df1)
rate_1 = as.numeric(avg_disease_rate_df1)
rate_2 = as.numeric(avg_disease_rate_df2)
rate_3 = as.numeric(avg_disease_rate_df3)
# stack data
d1 = data.frame(class = "Less than 5", names = names, rate = rate_1)
d2 = data.frame(class = "5 to 10", names = names, rate = rate_2)
d3 = data.frame(class = "More than 10", names = names, rate = rate_3)
plot_data = rbind(d1, d2, d3)
Run Code Online (Sandbox Code Playgroud)
制作情节:
library(ggplot2)
ggplot(plot_data, aes(x=names, y=rate, group = class, color=class)) + geom_point() + geom_line() + theme_bw()
Run Code Online (Sandbox Code Playgroud)
由于某种原因,x 轴上的顺序不按顺序- 现在它随机出现,我想将其从小到大排列。
我查阅了一些参考资料,其中显示了如何手动更改此设置 - 但是ggplot2 中是否有一些选项允许自动更正此顺序?
谢谢!
编辑 - 您可以在 1) 绘图步骤或 2) 数据操作步骤中执行此操作。
我认为最简单的方法是将 x 轴变量转换为一个因子并对其进行排序。
现在是一个角色
str(plot_data)
Run Code Online (Sandbox Code Playgroud)
'data.frame': 30 obs. of 3 variables:
$ class: chr "Less than 5" "Less than 5" "Less than 5" "Less than 5" ...
$ names: chr "[52.6,59.9]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" ...
$ rate : num 0.6 0.1 0.339 0.399 0.438 ...
Run Code Online (Sandbox Code Playgroud)
所以你可以把它变成一个因素,然后检查水平:
plot_data$names <- as.factor(plot_data$names)
levels(plot_data$names)
Run Code Online (Sandbox Code Playgroud)
这将以某种随机的顺序显示它们:
[1] "(104,111]" "(111,118]" "(118,125]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" "(81.8,89.1]" "(89.1,96.3]" "(96.3,104]"
[10] "[52.6,59.9]"
Run Code Online (Sandbox Code Playgroud)
然后您可以使用库重新调整它们forcats(还有其他选项,但我喜欢这个):
plot_data$names <- fct_relevel(plot_data$names,
c("[52.6,59.9]", "(59.9,67.2]", "(67.2,74.5]",
"(74.5,81.8]", "(81.8,89.1]", "(89.1,96.3]",
"(96.3,104]", "(104,111]", "(111,118]" ))
Run Code Online (Sandbox Code Playgroud)
然后你的情节将如下所示:
ggplot(plot_data, aes(x=names, y=rate, group = class, color=class)) +
geom_point() +
geom_line() +
theme_bw()
Run Code Online (Sandbox Code Playgroud)
当您weight_ntile使用cut()它时,您可以要求排序结果(ordered_result = TRUE),这将为您提供排序因子。但是,如果保持原样,其余的数据操作将消除该有序因子。相反,您可以在管道系列中使用dplyr和来完成这一切tidyr。这是一种方法:
plot_data$names <- as.factor(plot_data$names)
levels(plot_data$names)
Run Code Online (Sandbox Code Playgroud)
然后可以用同样的方式绘制情节:
[1] "(104,111]" "(111,118]" "(118,125]" "(59.9,67.2]" "(67.2,74.5]" "(74.5,81.8]" "(81.8,89.1]" "(89.1,96.3]" "(96.3,104]"
[10] "[52.6,59.9]"
Run Code Online (Sandbox Code Playgroud)