我有 2 个数据集,一个是建模(人工)数据,另一个是观察到的数据。它们的统计分布略有不同,我想强制建模数据与数据传播中观察到的数据分布相匹配。换句话说,我需要建模数据来更好地表示观察数据的尾部。这是一个例子。
model <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
observed <- c(39.50,44.79,58.28,56.04,53.40,59.25,48.49,54.51,35.38,39.98,28.00,
28.49,27.74,51.92,42.53,44.91,44.91,40.00,41.51,47.92,36.98,53.40,
42.26,42.89,43.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
52.81,36.87,47.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
51.34,43.37,51.15,42.77,42.88,44.26,27.14,39.31,24.80,12.62,30.30,
34.39,25.60,38.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
34.65,39.54,47.70,38.11,43.05,29.95,22.48,24.63,35.33,41.34)
summary(model)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.62 36.98 40.38 40.28 44.91 54.15
summary(observed)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12.62 35.54 42.58 41.10 47.76 59.2
Run Code Online (Sandbox Code Playgroud)
如何强制模型数据具有 R 中观察到的变异性?
您只是对 的分布进行建模吗observed?如果是这样,您可以根据观察结果生成核密度估计,然后从建模的密度分布中重新采样。例如:
library(ggplot2)
Run Code Online (Sandbox Code Playgroud)
首先,我们根据观测值生成密度估计。这是我们的观测值分布模型。adjust是决定带宽的参数。默认值为 1。较小的值会导致较少的平滑(即,更紧密地遵循数据中小规模结构的密度估计):
dens.obs = density(observed, adjust=0.8)
Run Code Online (Sandbox Code Playgroud)
现在,从密度估计中重新采样以获得建模值。我们进行设置prob=dens.obs$y,以便选择一个值的概率dens.obs$x与其建模密度成正比。
set.seed(439)
resample.obs = sample(dens.obs$x, 1000, replace=TRUE, prob=dens.obs$y)
Run Code Online (Sandbox Code Playgroud)
将观测值和建模值放入数据框中,准备绘图:
dat = data.frame(value=c(observed,resample.obs),
group=rep(c("Observed","Modeled"), c(length(observed),length(resample.obs))))
Run Code Online (Sandbox Code Playgroud)
下面的 ECDF(经验累积分布函数)图显示,从核密度估计中采样得到的样本的分布与观察到的数据相似:
ggplot(dat, aes(value, fill=group, colour=group)) +
stat_ecdf(geom="step") +
theme_bw()
Run Code Online (Sandbox Code Playgroud)
您还可以绘制观测数据的密度分布和从建模分布中采样的值(使用adjust与我们上面使用的参数相同的值)。
ggplot(dat, aes(value, fill=group, colour=group)) +
geom_density(alpha=0.4, adjust=0.8) +
theme_bw()
Run Code Online (Sandbox Code Playgroud)