我正在使用插入符包训练R中的模型:
ctrl <- trainControl(method = "repeatedcv", repeats = 3, summaryFunction = twoClassSummary)
logitBoostFit <- train(LoanStatus~., credit, method = "LogitBoost", family=binomial, preProcess=c("center", "scale", "pca"),
trControl = ctrl)
Run Code Online (Sandbox Code Playgroud)
我收到以下警告:
Warning message:
In train.default(x, y, weights = w, ...): The metric "Accuracy" was not in the result set. ROC will be used instead.Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.03496 Min. :0.9747
1st Qu.: NA 1st Qu.:0.03919 1st Qu.:0.9758
Median : NA Median :0.04343 Median :0.9770
Mean :NaN Mean :0.04349 Mean :0.9779
3rd Qu.: NA 3rd Qu.:0.04776 3rd Qu.:0.9795
Max. : NA Max. :0.05210 Max. :0.9821
NA's :3
Error in train.default(x, y, weights = w, ...): Stopping
Run Code Online (Sandbox Code Playgroud)
我安装了pROC包:
install.packages("pROC", repos="http://cran.rstudio.com/")
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: ‘pROC’
The following objects are masked from ‘package:stats’:
cov, smooth, var
Run Code Online (Sandbox Code Playgroud)
这是数据:
str(credit)
'data.frame': 8580 obs. of 45 variables:
$ ListingCategory : int 1 7 3 1 1 7 1 1 1 1 ...
$ IncomeRange : int 3 4 6 4 4 3 3 4 3 3 ...
$ StatedMonthlyIncome : num 2583 4326 10500 4167 5667 ...
$ IncomeVerifiable : logi TRUE TRUE TRUE FALSE TRUE TRUE ...
$ DTIwProsperLoan : num 1.8e-01 2.0e-01 1.7e-01 1.0e+06 1.8e-01 4.4e-01 2.2e-01 2.0e-01 2.0e-01 3.1e-01 ...
$ EmploymentStatusDescription: Factor w/ 7 levels "Employed","Full-time",..: 1 4 1 7 1 1 1 1 1 1 ...
$ Occupation : Factor w/ 65 levels "","Accountant/CPA",..: 37 37 20 14 43 58 48 37 37 37 ...
$ MonthsEmployed : int 4 44 159 67 26 16 209 147 24 9 ...
$ BorrowerState : Factor w/ 48 levels "AK","AL","AR",..: 22 32 5 5 14 28 4 10 10 34 ...
$ BorrowerCity : Factor w/ 3089 levels "AARONSBURG","ABERDEEN",..: 1737 3059 2488 654 482 719 895 1699 2747 1903 ...
$ BorrowerMetropolitanArea : Factor w/ 1 level "(Not Implemented)": 1 1 1 1 1 1 1 1 1 1 ...
$ LenderIndicator : int 0 0 0 1 0 0 0 0 1 0 ...
$ GroupIndicator : logi FALSE FALSE FALSE TRUE FALSE FALSE ...
$ GroupName : Factor w/ 83 levels "","00 Used Car Loans",..: 1 1 1 47 1 1 1 1 1 1 ...
$ ChannelCode : int 90000 90000 90000 80000 40000 40000 90000 90000 80000 90000 ...
$ AmountParticipation : int 0 0 0 0 0 0 0 0 0 0 ...
$ MonthlyDebt : int 247 785 1631 817 644 1524 427 817 654 749 ...
$ CurrentDelinquencies : int 0 0 0 0 0 0 0 1 0 1 ...
$ DelinquenciesLast7Years : int 0 10 0 0 0 0 0 0 0 0 ...
$ PublicRecordsLast10Years : int 0 1 0 0 0 0 1 0 1 0 ...
$ PublicRecordsLast12Months : int 0 0 0 0 0 0 0 0 0 0 ...
$ FirstRecordedCreditLine : Factor w/ 4719 levels "1/1/00 0:00",..: 3032 2673 1197 2541 4698 4345 3150 925 4452 2358 ...
$ CreditLinesLast7Years : int 53 30 36 26 7 22 15 20 34 32 ...
$ InquiriesLast6Months : int 2 8 5 0 0 0 0 3 0 0 ...
$ AmountDelinquent : int 0 0 0 0 0 0 0 63 0 15 ...
$ CurrentCreditLines : int 10 10 18 10 4 11 6 10 7 8 ...
$ OpenCreditLines : int 9 10 15 8 3 8 5 7 7 8 ...
$ BankcardUtilization : num 0.26 0.69 0.94 0.69 0.81 0.38 0.55 0.24 0.03 0 ...
$ TotalOpenRevolvingAccounts : int 9 7 12 10 3 5 4 5 4 6 ...
$ InstallmentBalance : int 48648 14827 0 0 0 30916 0 21619 41340 15447 ...
$ RealEstateBalance : int 0 0 577745 0 0 0 191296 0 0 126039 ...
$ RevolvingBalance : int 5265 9967 94966 50511 37871 22463 19550 2436 1223 3236 ...
$ RealEstatePayment : int 0 0 4159 0 0 0 1303 0 0 1279 ...
$ RevolvingAvailablePercent : int 78 52 36 45 18 61 44 74 96 76 ...
$ TotalInquiries : int 8 11 15 2 0 0 1 7 1 1 ...
$ TotalTradeItems : int 53 30 36 26 7 22 15 20 34 32 ...
$ SatisfactoryAccounts : int 52 23 36 26 7 19 15 18 34 29 ...
$ NowDelinquentDerog : int 0 0 0 0 0 0 0 1 0 1 ...
$ WasDelinquentDerog : int 1 7 0 0 0 3 0 1 0 2 ...
$ OldestTradeOpenDate : int 5092001 5011977 12011984 4272000 9081993 9122000 6161987 11181999 9191990 4132000 ...
$ DelinquenciesOver30Days : int 0 6 0 0 0 13 0 2 0 2 ...
$ DelinquenciesOver60Days : int 0 4 0 0 0 0 0 0 0 1 ...
$ DelinquenciesOver90Days : int 0 10 0 0 0 0 0 0 0 0 ...
$ IsHomeowner : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
$ LoanStatus : Factor w/ 2 levels "0","1": 2 1 1 2 2 2 2 2 2 1 .`..
Run Code Online (Sandbox Code Playgroud)
摘要(信用)ListingCategory IncomeRange StatedMonthlyIncome IncomeVerifiable Min.:0.000分钟 :1.000分钟 :0模式:逻辑
1 Qu.:1.000 1st Qu.:3.000 1st Qu.:3167 FALSE:784
Median:2.000 Median:4.000 Median:4750 TRUE:7796
Mean:4.997 Mean:4.089 Mean:5755 NA's:0
3rd Qu. :7.000 3rd Qu.:5.000 3rd Qu.:7083
Max.:20.000最大.:7.000最大.:250000
DTIwProsperLoan EmploymentStatusDescription MonthsEmployed
Min.:0.0就业人数:7182分钟 :-23.00
1st Qu.:0.1 全日制:416 1st Qu.:26.00
Median:0.2未使用:122平均:68.00
平均值:91609.4其他:475平均值:97.44
3rd Qu.:0.3 兼职:7 3rd Qu. :139.00
最大.:1000000.0退休:32最大.:755.00
自雇人数:346 NA:5
BorrowerState LenderIndicator GroupIndicator ChannelCode
CA:1056 Min.:0.00000模式:逻辑分钟.:40000
FL:608 1st Qu.:0.00000 FALSE:8325 1st Qu.:80000
NY:574平均:0.00000 TRUE:255平均:80000
TX:532平均值:0.09196 NA's:0平均值:77196
IL:443 3rd Qu.:0.00000 3rd Qu.:90000
GA:343 Max.:1.00000最大.:90000
(其他):5024
MonthlyDebt CurrentDelinquencies DelinquenciesLast7Years Min.:0.0分钟 :0.0000分钟 :0.000
第1期:364.0第1期:0.0000第1期
:0.000
平均值:708.0 平均值:0.0000 平均值:0.000 平均值:885.5平均值:0.4119平均值:
4.009第3期:1205.2第3期:0.0000第3期:3.000
最大.:30213.0最大.:21.0000 Max.:99.000
PublicRecordsLast10Years PublicRecordsLast12Months CreditLinesLast7Years Min.:0.0000分钟 :0.00000分钟 :2.0
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:16.0
Median:0.0000 Median:0.00000 Median:24.0
平均值:0.2809平均值:0.01364平均值:26.1
3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:34.0
最大.:11.0000 Max.:4.00000 Max.:115.0
InquiriesLast6Months AmountDelinquent CurrentCreditLines OpenCreditLines Min.:0.0000分钟 :0分钟 :0.000分钟 :0.000
1st
Qu.:0.0000 1st Qu.:0 1st Qu.:5.000 1st Qu.:5.000
Median:1.0000 Median:0 Median:9.000 Median:8.000
Mean:0.9994 Mean:1195 Mean:9.345 Mean:8.306
3rd Qu.: 1.0000 3rd Qu.:0 3rd Qu.:12.000 3rd Qu.:11.000
Max.:15.0000 Max.:179158 Max.:54.000 Max.:42.000
BankcardUtilization TotalOpenRevolvingAccounts InstallmentBalance Min.:0.0000分钟 :0.000分钟 :0
1st Qu.:0.2500 1st Qu.:3.000 1st
Qu.:3338 Median:0.5400 Median:6.000 Median
:14453 Mean:0.5182 Mean:6.441 Mean:24900
3rd Qu.:0.7900 3rd Qu.:9.000 3rd Qu.:32238
最大.:2.2300 Max.:44.000 Max.:739371
NA's:328 RealEstateBalance
RevolvingBalance RealEstatePayment RevolvingAvailablePercent Min.:0分钟 :0分钟 :0.0分钟 :0.00
1st Qu.:0 1st Qu.:2799 1st Qu.:0.0 1st Qu.:29.00
中位数:26154
平均值:8784平均值:346.5平均值:52.00 平均值:109306平均值:19555平均值:830.5平均值:51.46
第三曲:176542第三曲:21110第三曲:1382.2第三曲:75.00
最大.:1938421 Max.:695648 Max.:13651.0 Max.:100.00
TotalInquiries TotalTradeItems SatisfactoryAccounts NowDelinquentDerog Min.:0.00分钟 :2.0分钟 :1.00分钟 :0.0000
第一份:2.00第一份:16.0第一份:14.00第一份:0.0000
中位数:3.00中位数:24.0中位数:21.00中位数:0.0000
平均值:3.91平均值:26.1平均值:23.34平均值:0.4119
第三名: 5.00 3rd Qu.:34.0 3rd Qu.:30.25 3rd Qu.:0.0000
Max.:最高36.00 :最高115.0 :最高113.00 :21.0000
WasDelinquentDerog OldestTradeOpenDate DelinquenciesOver30Days Min.:0.000分钟 :1011957分钟 :0.000
第1次:0.000第1次:4101996第1次:0.000位置
:1.000中位数:7191993中位数:1.000
平均值:2.343平均值:6934230平均值:4.332
第3次:3.000第3次Qu.:10011990第3次:5.000次
最大.:32.000 Max.:12312004 Max.:99.000
违约超过60天的违约超过90天是否是房主LoanStatus Min.:0.000分钟 :0.000模式:逻辑0:1518
第一序:0.000第一序:0.000假:4264 1:7062
中位数:0.000中位数:0.000 TRUE:4316
平均值:1.908平均值:4.009 NA:0
第3曲:2.000第3曲. :3.000
最大.:73.000 Max.:99.000
我没有找到任何遗漏的值:
try(na.fail(credit))
dput(head(credit,4))
structure(list(ListingCategory = c(1L, 7L, 3L, 1L), IncomeRange = c(3L,
4L, 6L, 4L), StatedMonthlyIncome = c(2583.3333, 4326, 10500,
4166.6667), IncomeVerifiable = c(TRUE, TRUE, TRUE, FALSE), DTIwProsperLoan = c(0.18,
0.2, 0.17, 1e+06), EmploymentStatusDescription = structure(c(1L,
4L, 1L, 7L), .Label = c("Employed", "Full-time", "Not employed",
"Other", "Part-time", "Retired", "Self-employed"), class = "factor"),
MonthsEmployed = c(4L, 44L, 159L, 67L), BorrowerState = structure(c(22L,
32L, 5L, 5L), .Label = c("AK", "AL", "AR", "AZ", "CA", "CO",
"CT", "DC", "DE", "FL", "GA", "HI", "ID", "IL", "IN", "KS",
"KY", "LA", "MA", "MD", "MI", "MN", "MO", "MS", "MT", "NC",
"NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA",
"RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI",
"WV", "WY"), class = "factor"), LenderIndicator = c(0L, 0L,
0L, 1L), GroupIndicator = c(FALSE, FALSE, FALSE, TRUE), ChannelCode = c(90000L,
90000L, 90000L, 80000L), MonthlyDebt = c(247L, 785L, 1631L,
817L), CurrentDelinquencies = c(0L, 0L, 0L, 0L), DelinquenciesLast7Years = c(0L,
10L, 0L, 0L), PublicRecordsLast10Years = c(0L, 1L, 0L, 0L
), PublicRecordsLast12Months = c(0L, 0L, 0L, 0L), CreditLinesLast7Years = c(53L,
30L, 36L, 26L), InquiriesLast6Months = c(2L, 8L, 5L, 0L),
AmountDelinquent = c(0L, 0L, 0L, 0L), CurrentCreditLines = c(10L,
10L, 18L, 10L), OpenCreditLines = c(9L, 10L, 15L, 8L), BankcardUtilization = c(0.26,
0.69, 0.94, 0.69), TotalOpenRevolvingAccounts = c(9L, 7L,
12L, 10L), InstallmentBalance = c(48648L, 14827L, 0L, 0L),
RealEstateBalance = c(0L, 0L, 577745L, 0L), RevolvingBalance = c(5265L,
9967L, 94966L, 50511L), RealEstatePayment = c(0L, 0L, 4159L,
0L), RevolvingAvailablePercent = c(78L, 52L, 36L, 45L), TotalInquiries = c(8L,
11L, 15L, 2L), TotalTradeItems = c(53L, 30L, 36L, 26L), SatisfactoryAccounts = c(52L,
23L, 36L, 26L), NowDelinquentDerog = c(0L, 0L, 0L, 0L), WasDelinquentDerog = c(1L,
7L, 0L, 0L), OldestTradeOpenDate = c(5092001L, 5011977L,
12011984L, 4272000L), DelinquenciesOver30Days = c(0L, 6L,
0L, 0L), DelinquenciesOver60Days = c(0L, 4L, 0L, 0L), DelinquenciesOver90Days = c(0L,
10L, 0L, 0L), IsHomeowner = c(FALSE, FALSE, TRUE, FALSE),
LoanStatus = structure(c(2L, 1L, 1L, 2L), .Label = c("0",
"1"), class = "factor")), .Names = c("ListingCategory", "IncomeRange",
"StatedMonthlyIncome", "IncomeVerifiable", "DTIwProsperLoan",
"EmploymentStatusDescription", "MonthsEmployed", "BorrowerState",
"LenderIndicator", "GroupIndicator", "ChannelCode", "MonthlyDebt",
"CurrentDelinquencies", "DelinquenciesLast7Years", "PublicRecordsLast10Years",
"PublicRecordsLast12Months", "CreditLinesLast7Years", "InquiriesLast6Months",
"AmountDelinquent", "CurrentCreditLines", "OpenCreditLines",
"BankcardUtilization", "TotalOpenRevolvingAccounts", "InstallmentBalance",
"RealEstateBalance", "RevolvingBalance", "RealEstatePayment",
"RevolvingAvailablePercent", "TotalInquiries", "TotalTradeItems",
"SatisfactoryAccounts", "NowDelinquentDerog", "WasDelinquentDerog",
"OldestTradeOpenDate", "DelinquenciesOver30Days", "DelinquenciesOver60Days",
"DelinquenciesOver90Days", "IsHomeowner", "LoanStatus"), row.names = c(NA,
4L), class = "data.frame")
Run Code Online (Sandbox Code Playgroud)
关于什么是错的任何想法?
Warning message:
In train.default(x, y, weights = w, ...): The metric "Accuracy" was not in the result set. ROC will be used instead.
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.667624
iter 20 value 3329.692768
iter 30 value 3279.191024
iter 40 value 3264.926986
iter 50 value 3259.276647
iter 60 value 3259.056261
final value 3259.032668
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.774666
iter 20 value 3330.016829
iter 30 value 3279.545595
iter 40 value 3265.384385
iter 50 value 3259.499032
iter 60 value 3259.353010
final value 3259.342601
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3540.667731
iter 20 value 3329.693092
iter 30 value 3279.191379
iter 40 value 3264.927427
iter 50 value 3259.276899
iter 60 value 3259.056561
final value 3259.032978
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.401458
iter 20 value 3314.932958
iter 30 value 3264.117072
iter 40 value 3253.780051
iter 50 value 3253.368959
iter 60 value 3253.359047
final value 3253.358819
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.508505
iter 20 value 3315.134599
iter 30 value 3265.021404
iter 40 value 3255.739021
iter 50 value 3253.817833
iter 60 value 3253.697180
final value 3253.671003
converged
# weights: 72 (71 variable)
initial value 5144.538374
iter 10 value 3528.401565
iter 20 value 3314.933160
iter 30 value 3264.117768
iter 40 value 3253.780539
iter 50 value 3253.369030
iter 60 value 3253.359358
final value 3253.359133
converged
# weights: 71 (70 variable)
initial value 5145.231521
iter 10 value 4680.326236
iter 20 value 4672.506024
iter 30 value 3662.998233
iter 40 value 3310.207744
iter 50 value 3252.983656
iter 60 value 3250.400275
iter 70 value 3250.339216
final value 3250.332646
converged
Run Code Online (Sandbox Code Playgroud)
...#权重:72(71变量)初始值5144.538374 iter 10值4661.569290 iter 20值4652.246624 iter 30值3715.472355 iter 40值3484.096833 iter 50值3254.247424 iter 60值3248.931841 iter 70值3248.154679 iter 80值3248.129089 iter 80值3248.129085最终值3248.128574收敛#权重:72(71变量)初始值5144.538374 iter 10值4663.660886 iter 20值4654.255466 iter 30值3542.473235 iter 40值3315.027437 iter 50值3250.340679 iter 60值3248.693378 iter 70值3248.455840 iter 80值3248.443345 iter 80值3248.443325 iter 80 value 3248.443325 final value 3248.443325 converged#weights:72(71变量)初始值5144.538374 iter 10 value 4661.571382 iter 20 value 4652.248711 iter 30 value 4397.069608 iter 40 value 3532.067046 iter 50 value 3283.179445 iter 60 value 3249.518694 iter 70 value 3248.163057 iter 80 value 3248.129552 final value 3248.128889 converged警告信息:在nominalTrainWorkflow( x = x,y = y,wts = weights,info = trainInfo,:重采样性能度量中存在缺失值.出现问题;缺少所有ROC度量值:ROC Sens Spec163057 iter 80 value 3248.129552 final value 3248.128889 converged警告消息:在nominalTrainWorkflow中(x = x,y = y,wts = weights,info = trainInfo,:重采样性能度量中存在缺失值.出错了;所有ROC度量值缺少:ROC Sens Spec163057 iter 80 value 3248.129552 final value 3248.128889 converged警告消息:在nominalTrainWorkflow中(x = x,y = y,wts = weights,info = trainInfo,:重采样性能度量中存在缺失值.出错了;所有ROC度量值缺少:ROC Sens Spec
闵.:NA Min.:0.01805分钟 :0.9946
1st Qu.:NA 1st Qu.:0.01805 1st Qu.:0.9946
Median:NA
平均值:0.01805平均值:0.9946平均值:NaN平均值:0.01805平均值:0.9946
第三曲:NA第三曲:0.01805 第三曲:0.9946
最大.:NA Max.:0.01805最大 :0.9946
NA:3
在train.default中出错(x,y,weight = w,...):停止
summaryFunction = twoClassSummary似乎触发警告.它也发生在这里:
ctrl <- trainControl(method = "cv", summaryFunction = twoClassSummary)
multinomSummaryFit <- train(LoanStatus~., credit, method = "multinom", family=binomial,
trControl = ctrl)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
Something is wrong; all the ROC metric values are missing:
ROC Sens Spec
Min. : NA Min. :0.01919 Min. :0.9941
1st Qu.: NA 1st Qu.:0.01988 1st Qu.:0.9942
Median : NA Median :0.02056 Median :0.9943
Mean :NaN Mean :0.02011 Mean :0.9943
3rd Qu.: NA 3rd Qu.:0.02056 3rd Qu.:0.9943
Max. : NA Max. :0.02057 Max. :0.9944
NA's :3
Error in train.default(x, y, weights = w, ...): Stopping
Run Code Online (Sandbox Code Playgroud)
看看输出summary(credit),我可以看到NA至少有两个变量的值;
变量MonthsEmployed具有以下5 NA值:
MonthsEmployed
Min. :-23.00
1st Qu.: 26.00
Median : 68.00
Mean : 97.44
3rd Qu.:139.00
Max. :755.00
NA's :5
Run Code Online (Sandbox Code Playgroud)
并且变量InstallmentBalance具有328 NA值.
InstallmentBalance
Min. : 0
1st Qu.: 3338
Median : 14453
Mean : 24900
3rd Qu.: 32238
Max. :739371
NA's :328
Run Code Online (Sandbox Code Playgroud)
尝试删除缺少值的行(或临时删除这两个变量)并再次运行该函数以查看是否可以解决您的问题.
此外,您还需要添加metric = "ROC"的train功能,并classProbs = TRUE以trainControl()当您使用twoClassSummary
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary) .
Run Code Online (Sandbox Code Playgroud)
所以,你的电话应该是
multinomSummaryFit <- train(LoanStatus~.,
data = credit,
method = "multinom",
family=binomial,
metric = "ROC",
trControl = ctrl)
Run Code Online (Sandbox Code Playgroud)
关于数据集的另一个重要问题是,您需要仔细检查变量的值并确保每个值都有意义.例如,MonthsEmployed变量具有负值.从逻辑上讲,员工的就业人数为正数.这些负面价值观是错误的,还是意味着别的东西!(例如,值为-23表示该人未被雇用23个月).
回答你的问题confusionMatrix:
假设您的训练模型被调用multinomSummaryFit.为了在测试数据集上评估您的模型,您需要predict在没有LoanStatus(使用您训练模型的相同变量)的情况下调用测试数据集上的方法,然后将模型预测与实际值进行比较LoanStatus.例如,
#let's say your test datafrme is called test
mymodel_pred <- predict(multinomSummaryFit, test[, names(test) != "LoanStatus"])
Run Code Online (Sandbox Code Playgroud)
然后使用confusionMatrix:
confusionMatrix(data = mymodel_pred,
reference = test$LoanStatus,
positive = "Default")
Run Code Online (Sandbox Code Playgroud)
如果测试数据集没有该LoanStatus列,那么您只需使用:
mymodel_pred <- predict(multinomSummaryFit, test)
Run Code Online (Sandbox Code Playgroud)
但在这种情况下,如果您不知道实际响应,则无法在测试数据集上评估模型.
请记住,如果从训练数据集中删除了任何变量,则在调用之前还需要从测试数据集中删除它们 predict
使用分层抽样分割数据以进行训练和测试:
trainingRows <- createDataPartition(credit$LoanStatus, p = .70, list= FALSE)
train <- credit[trainingRows, ]
test <- credit[-trainingRows, ]
Run Code Online (Sandbox Code Playgroud)