使用线性判别分析或逻辑回归评估/改进预测

Cio*_*chi 4 analysis r lda logistic-regression

我是R的新手,虽然我在Matlab和一些基础数据分析方面非常熟练,尽管我只进行了基本的统计分析(从未使用超过一些Mann-Whitney/T-test/correlation等) .

我最近需要在一些数据集上组合两个或多个变量来评估它们的组合是否可以增强预测性,因此我在R中做了一些逻辑回归.现在,在统计问答上,有人建议我可以使用线性判别分析.由于我在Matlab中没有任何fitcdiscr.m,我宁愿在R中使用lda,但我不能使用拟合结果来预测AUC或我可以使用的任何东西.实际上,我看到R中的lda的适合输出向量是某种具有多个类的向量,我想我应该使用fit $ posterior来预测对照控件的情况,但我无法从中获取这些数据.有帮助吗？

有关更多信息,我得到这个结果适合$ posterior:

$posterior
            0          1
1   0.7707927 0.22920726
2   0.7085165 0.29148352
3   0.6990989 0.30090106
4   0.5902161 0.40978387
5   0.8667109 0.13328912
6   0.6924406 0.30755939
7   0.7471086 0.25289141
8   0.7519326 0.24806736

Run Code Online (Sandbox Code Playgroud)

等到242的最后一次观察.每次我尝试采用例如第1列的拟合后验[,1],我得到:

        1         2         3         4         5         6         7         8 
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326 
        9        10        11        12        13        14        15        16 
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121

Run Code Online (Sandbox Code Playgroud)

编辑:我不知道代码的哪一部分可能有用,因为我做了非常基本的计算:

library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)

Run Code Online (Sandbox Code Playgroud)

我无法链接数据,无论如何心电图变量只是一个N观察x P变量,N = N1 + N2,N1控制数量和N2个案数量,病例定义为随后发生病理学的受试者起来.对于控件和案例,最后一列数据分别仅为0或1.

当我执行逻辑回归时,我做了:

mod1<-glm(Case ~ temp + temp1, data=data,     family="binomial"); 
auctemp=auc(Case~predict(mod1), data=data);

Run Code Online (Sandbox Code Playgroud)

这是我关于逻辑回归和预测的输入(我对线性判别知之甚少,但理解它与逻辑回归密切相关,我知道的更好).我不确定我是否遵循你的所有推理,也不确定这是否是一个令人满意的答案,但希望它不会受到伤害.这是对我的一些流行病学课程的回顾.我希望它不是太正式,至少部分解决了你的一些问题.如果没有,如果其他用户认为这最好属于Cross Validated,我不会冒犯.:)

样本数据

我们将首先生成200个观察值,其中Case = 1的概率水平越来越高.第一个预测变量(pred1)将遵循非线性分布,接近于进行逻辑回归时建模的分布.它将与案件的比例密切相关.第二个预测器只是随机的,均匀分布的噪声.

set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE), 
                          sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE), 
                          sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
                 pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
                 pred2 = runif(n = 200, min = 0, max = 100))

Run Code Online (Sandbox Code Playgroud)

我们在下面的方框图中看到case==1通常具有更高的观察结果pred1,这是预期的(从我们生成数据的方式).同时,存在重叠,否则将使得决定截止点/阈值变得太容易.

boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")

Run Code Online (Sandbox Code Playgroud)

箱形图

拟合逻辑模型

首先使用两个预测变量:

model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)

# Coefficients:
#              Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -2.058258   0.479094  -4.296 1.74e-05 ***
# pred1        0.428491   0.075373   5.685 1.31e-08 ***
# pred2        0.003399   0.005500   0.618    0.537    
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# 
# (Dispersion parameter for binomial family taken to be 1)
# 
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.51  on 197  degrees of freedom
# AIC: 244.51

Run Code Online (Sandbox Code Playgroud)

正如我们所预料的那样,第一个预测因子具有相当强的相关性,而第二个预测因子与结果关系不大.

请注意,要从这些系数中获取赔率比率,我们需要对它们进行取幂:

exp(model.1$coefficients[2:3])

#    pred1    pred2 
# 1.534939 1.003405   # Odds Ratios (making the relationships appear more clearly). 
                      # Use `exp(confint(model.1))` to get confidence intervals.

Run Code Online (Sandbox Code Playgroud)

我们将此模型与更简单的模型进行比较,删除第二个预测变量:

model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)

# Coefficients:
#             Estimate Std. Error z value Pr(>|z|)    
# (Intercept) -1.87794    0.37452  -5.014 5.32e-07 ***
# pred1        0.42651    0.07514   5.676 1.38e-08 ***
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
#     Null deviance: 276.76  on 199  degrees of freedom
# Residual deviance: 238.89  on 198  degrees of freedom
# AIC: 242.89

exp(model.2$coefficients)[2]

#    pred1 
# 1.531907  # Odds Ratio

Run Code Online (Sandbox Code Playgroud)

我们也可以运行一个anova(model.1, model.2),但让我们跳过这一部分继续进行预测,保持这个更简单的模型,因为第二个变量不会增加太多预测值,如果有的话.实际上,拥有更多的预测因素很少是一个问题,除非它是真正的随机噪声,但在这里我更多地关注预测和选择适当阈值的操作.

存储的预测

在model.2对象(列表)中,有一个名为的项目fitted.values.这些值与我们得到的完全相同predict(model.2, type="response"),可以解释为概率; 每行一个,基于预测变量及其系数.

新的预测

也可以预测不在我们的初始数据帧中的假设行的结果.

有model.1(2个预测因子):

predict(model.1, newdata = list(pred1=1, pred2=42), type="response")

#         1 
# 0.1843701

Run Code Online (Sandbox Code Playgroud)

使用model.2(1个预测器):

predict(model.2, newdata = list(pred1=12), type="response")

#       1 
# 0.96232

Run Code Online (Sandbox Code Playgroud)

从概率到二元响应

回顾我们的预测器pred1与计算出的概率之间的联系Case=1:

plot(df$pred1, model.2$fitted.values, 
     xlab="pred1", ylab="probability that Case=1")

Run Code Online (Sandbox Code Playgroud)

P(案例)反对pred1

我们注意到,由于我们只有一个预测器,因此概率是它的直接函数.如果我们将另一个预测因子保留在等式中,我们会看到在同一条线上分组的点,但是在点云中.

但这并没有改变这样一个事实,即如果我们要评估我们的模型能够多好地预测二元结果,我们需要确定一个阈值,高于该阈值我们会认为观察是一个案例.有几个软件包可以帮助我们找到这个门槛.但即使没有任何额外的包,我们也可以使用如下函数计算一系列阈值的各种属性,这将计算灵敏度(检测真实案例的能力),特异性(识别真实非案例的能力),以及其他这里描述的属性很好.

df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
    df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
    df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
    df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
    df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
    df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}

which.max(df.ana$accuracy)

# [1] 46 

optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46

Run Code Online (Sandbox Code Playgroud)

准确度是对所有预测的正确预测的比例.第46个门槛(0.46)是这个问题的"最佳".让我们检查生成的数据帧中的一些其他相邻行; 它告诉我们0.47在所有方面都可以正常工作.微调将涉及在我们的初始数据帧中添加一些新数据.

df.ana[45:48,]

#    thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45   0.45   0.7142857   0.6947368      0.7211538      0.6875000    0.705
# 46   0.46   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 47   0.47   0.7142857   0.7157895      0.7352941      0.6938776    0.715
# 48   0.48   0.7047619   0.7157895      0.7326733      0.6868687    0.710

Run Code Online (Sandbox Code Playgroud)

请注意,auc函数(曲线下面的区域)将给出与该阈值的精度相同的数字:

library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)

# Area under the curve: 0.715

Run Code Online (Sandbox Code Playgroud)

一些情节

# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
         xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")

# thresholds against Sensitivity 
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")

# thresholds against specificity 
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")

# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)

Run Code Online (Sandbox Code Playgroud)

准确性敏感性特异性

顺便说一句,所有的线条或多或少地汇集到同一点,这表明这是我们在预测工具中寻找的所有品质之间的良好折衷.但根据您的目标,选择较低或较高的阈值可能会更好.统计工具很有用,但最终,在做出最终决定时,其他一些考虑因素往往更为重要.

关于洛克石油

下图与使用pROC的roc生成的图相同:

plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
         xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity") 

# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))

Run Code Online (Sandbox Code Playgroud)

制表和其他统计数据

对于逻辑模型拟合,以下函数允许计算上面看到的相同统计数据,并为任何选定的阈值提供2x2表.

diagnos.test <- function(model, threshold) {
    output <- list()
    output$stats <- c(
      sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
      specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
      pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
      neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
      accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
    output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
    return(output)
}

diagnos.test(model.2, 0.47)

# $stats
#  sensitivity  specificity pos.pr.value neg.pr.value     accuracy 
#    0.7142857    0.7157895    0.7352941    0.6938776    0.7150000 
# 
# $tab
#            Cases
# Predictions   1  0 Sum
#         1    75 27 102
#         0    30 68  98
#         Sum 105 95 200

Run Code Online (Sandbox Code Playgroud)

最后的说明

我并不假装我已经涵盖了关于预测,敏感性和特异性的所有内容; 我的目标是尽可能使用通用语言和计算,而不是依赖于任何特定的包.

归档时间：	10 年，5 月前
查看次数：	1701 次
最近记录：	6 年，1 月前

如何计算逻辑向量中的TRUE值 152

我可以从顶部到底部更改ggplot中条带标签的位置吗？ 60

将列表强制转换为数据框的最有效方法是什么？ 46

状态名称缩写为R 42

相同(x,y)和isTRUE(all.equal(x,y))之间的R有什么不同？ 31

导出函数中的R data.table中断 20

R - 仅在具有混合类型的数据框中缩放数值 15

在Linux中替代R的`memory.size()`？ 12

从R中的(x,y)corrdinates创建热图 11

使用broom :: augment函数以响应比例获取预测值 11

如何在JavaScript中替换所有出现的字符串 4081

在Python中获取列表的最后一个元素 1871

如何为C#Auto-Property提供默认值？ 1773

喜欢构成而不是继承？ 1538

每位程序员应阅读的最具影响力的单一书籍是什么？ 1439

在Python中使用大写字母和数字生成随机字符串 1247

sh和bash之间的区别 1194

按ID删除元素 1085

在Python中查找扩展名为.txt的目录中的所有文件 1043

如何将div中的绝对定位元素居中？ 1034