请考虑以下数据:
contesto x y perc
1 M01 81.370 255.659 22
2 M02 85.814 242.688 16
3 M03 73.204 240.526 33
4 M04 66.478 227.916 46
5 M04a 67.679 218.668 15
6 M05 59.632 239.325 35
7 M06 64.316 252.777 23
8 M08 90.258 227.676 45
9 M09 100.707 217.828 58
10 M10 89.829 205.278 53
11 M11 114.998 216.747 15
12 M12 119.922 235.482 18
13 M13 129.170 239.205 36
14 M14 142.501 229.717 24
15 M15 76.206 213.144 24
16 M16 30.090 166.785 33
17 M17 130.731 219.989 56
18 M18 74.885 192.336 36
19 M19 48.823 142.645 32
20 M20 48.463 186.361 24
21 M21 74.765 205.698 16
Run Code Online (Sandbox Code Playgroud)
我想为perc加权的点x和y创建一个二维密度图.我能做到这一点(虽然我不认为正确)通过如下rep:
library(ggplot2)
dataset2 <- with(dataset, dataset[rep(1:nrow(dataset), perc),])
ggplot(dataset2, aes(x, y)) +
stat_density2d(aes(alpha=..level.., fill=..level..), size=2,
bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black", bins=10) +
geom_point(data = dataset) +
guides(alpha=FALSE) + xlim(c(10, 160)) + ylim(c(120, 280))
Run Code Online (Sandbox Code Playgroud)

这似乎不是正确的方法,因为其他geoms允许加权,如:
dat <- as.data.frame(ftable(mtcars$cyl))
ggplot(dat, aes(x=Var1)) + geom_bar(aes(weight=Freq))
Run Code Online (Sandbox Code Playgroud)
但是,如果我在这里尝试使用权重,则该图与数据不匹配(desc被忽略):
ggplot(dataset, aes(x, y)) +
stat_density2d(aes(alpha=..level.., fill=..level.., weight=perc),
size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black", bins=10, aes(weight=perc)) +
geom_point(data = dataset) +
guides(alpha=FALSE) + xlim(c(10, 160)) + ylim(c(120, 280))
Run Code Online (Sandbox Code Playgroud)

这是使用rep正确的方法来加权密度还是有更好的方法类似于weight论证geom_bar?
这个rep方法看起来像用基本R做的内核密度所以我假设这是它应该看起来的样子:

dataset <- structure(list(contesto = structure(1:21, .Label = c("M01", "M02",
"M03", "M04", "M04a", "M05", "M06", "M08", "M09", "M10", "M11",
"M12", "M13", "M14", "M15", "M16", "M17", "M18", "M19", "M20",
"M21"), class = "factor"), x = c(81.37, 85.814, 73.204, 66.478,
67.679, 59.632, 64.316, 90.258, 100.707, 89.829, 114.998, 119.922,
129.17, 142.501, 76.206, 30.09, 130.731, 74.885, 48.823, 48.463,
74.765), y = c(255.659, 242.688, 240.526, 227.916, 218.668, 239.325,
252.777, 227.676, 217.828, 205.278, 216.747, 235.482, 239.205,
229.717, 213.144, 166.785, 219.989, 192.336, 142.645, 186.361,
205.698), perc = c(22, 16, 33, 46, 15, 35, 23, 45, 58, 53, 15,
18, 36, 24, 24, 33, 56, 36, 32, 24, 16)), .Names = c("contesto",
"x", "y", "perc"), row.names = c(NA, -21L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
如果你的权重是每个坐标(或按比例)的#个观察值,我认为你做得对。该函数似乎期望所有观察结果,并且如果您在原始数据集上调用它,则无法动态更新 ggplot 对象,因为它已经对密度进行了建模,并且包含派生的绘图数据。
如果您的实际数据集很大,您可能想使用data.table而不是,它的速度大约快 70 倍。with()例如,请参阅此处了解 1m 坐标,具有 1-20 次重复(在本例中>10m 观察值)。不过,与 660 个观测值没有性能相关性(无论如何,该图可能会成为大型数据集的性能瓶颈)。
bigtable<-data.frame(x=runif(10e5),y=runif(10e5),perc=sample(1:20,10e5,T))
system.time(rep.with.by<-with(bigtable, bigtable[rep(1:nrow(bigtable), perc),]))
#user system elapsed
#11.67 0.18 11.92
system.time(rep.with.dt<-data.table(bigtable)[,list(x=rep(x,perc),y=rep(y,perc))])
#user system elapsed
#0.12 0.05 0.18
# CHECK THEY'RE THE SAME
sum(rep.with.dt$x)==sum(rep.with.by$x)
#[1] TRUE
# OUTPUT ROWS
nrow(rep.with.dt)
#[1] 10497966
Run Code Online (Sandbox Code Playgroud)