[R] Inconsistent results between caret+kernlab versions
Max Kuhn
mxkuhn at gmail.com
Fri Nov 15 21:59:51 CET 2013
Or not!
The issue with with kernlab.
Background: SVM models do not naturally produce class probabilities. A
secondary model (via Platt) is fit to the raw model output and a
logistic function is used to translate the raw SVM output to
probability-like numbers (i.e. sum to zero, between 0 and 1). In
ksvm(), you need to use the option prob.model = TRUE to get that
second model.
I discovered some time ago that there can be a discrepancy in the
predicted classes that naturally come from the SVM model and those
derived by using the class associated with the largest class
probability. This is most likely do to natural error in the secondary
probability model and should not be unexpected.
That is the case for your data. In you use the same tuning parameters
as those suggested by train() and go straight to ksvm():
> newSVM <- ksvm(x = as.matrix(df[,-1]),
+ y = df[,1],
+ kernel = rbfdot(sigma = svm.m1$bestTune$.sigma),
+ C = svm.m1$bestTune$.C,
+ prob.model = TRUE)
>
> predict(newSVM, df[43,-1])
[1] O32078
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
> predict(newSVM, df[43,-1], type = "probabilities")
O27479 O31403 O32057 O32059 O32060 O32078
[1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
O32089 O32663 O32668 O32676
[1,] 0.04890477 0.05210836 0.09838892 0.07284396
Note that, based on the probability model, the class with the largest
probability is O32057 (p = 0.24) while the basic SVM model predicts
O32078 (p = 0.16).
Somebody (maybe me) saw this discrepancy and that led to me to follow this rule:
if(prob.model = TRUE) use the class with the maximum probability
else use the class prediction from ksvm().
Therefore:
> predict(svm.m1, df[43,-1])
[1] O32057
10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
That change occurred between the two caret versions that you tested with.
(On a side note, can also occur with ksvm() and rpart() if
cost-sensitive training is used because the class designation takes
into account the costs but the class probability predictions do not. I
alerted both package maintainers to the issue some time ago.)
HTH,
Max
On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn <mxkuhn at gmail.com> wrote:
> I've looked into this a bit and the issue seems to be with caret. I've
> been looking at the svn check-ins and nothing stands out to me as the
> issue so far. The final models that are generated are the same and
> I'll try to figure out the difference.
>
> Two small notes:
>
> 1) you should set the seed to ensure reproducibility.
> 2) you really shouldn't use character stings with all numbers as
> factor levels with caret when you want class probabilities. It should
> give you a warning about this
>
> Max
>
> On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby <andrewdigby at mac.com> wrote:
>>
>> I'm using caret to assess classifier performance (and it's great!). However, I've found that my results differ between R2.* and R3.* - reported accuracies are reduced dramatically. I suspect that a code change to kernlab ksvm may be responsible (see version 5.16-24 here: http://cran.r-project.org/web/packages/caret/news.html). I get very different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + kernlab_0.9-19 (see below).
>>
>> Can anyone please shed any light on this?
>>
>> Thanks very much!
>>
>>
>> ### To replicate:
>>
>> require(repmis) # For downloading from https
>> df <- source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', sep=',')
>> require(caret)
>> svm.m1 <- train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv', number=10, repeats=10, classProbs=TRUE))
>> svm.m1
>> sessionInfo()
>>
>> ### Results - R2.15.2
>>
>>> svm.m1
>> 1241 samples
>> 7 predictors
>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>
>> No pre-processing
>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>
>> Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ...
>>
>> Resampling results across tuning parameters:
>>
>> C Accuracy Kappa Accuracy SD Kappa SD
>> 0.25 0.684 0.63 0.0353 0.0416
>> 0.5 0.729 0.685 0.0379 0.0445
>> 1 0.756 0.716 0.0357 0.0418
>>
>> Tuning parameter ‘sigma’ was held constant at a value of 0.247
>> Kappa was used to select the optimal model using the largest value.
>> The final values used for the model were C = 1 and sigma = 0.247.
>>> sessionInfo()
>> R version 2.15.2 (2012-10-26)
>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>
>> locale:
>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-17 repmis_0.2.4 caret_5.15-61 reshape2_1.2.2 plyr_1.8 lattice_0.20-10 foreach_1.4.0 cluster_1.14.3
>>
>> loaded via a namespace (and not attached):
>> [1] codetools_0.2-8 compiler_2.15.2 digest_0.6.0 evaluate_0.4.3 formatR_0.7 grid_2.15.2 httr_0.2 iterators_1.0.6 knitr_1.1 RCurl_1.95-4.1 stringr_0.6.2 tools_2.15.2
>>
>> ### Results - R3.0.2
>>
>>> require(caret)
>>> svm.m1 <- train(df[,-1],df[,1],method=’svmRadial’,metric=’Kappa’,tunelength=5,trControl=trainControl(method=’repeatedcv’, number=10, repeats=10, classProbs=TRUE))
>> Loading required package: class
>> Warning messages:
>> 1: closing unused connection 4 (https://dl.dropboxusercontent.com/u/47973221/df.Rdata)
>> 2: executing %dopar% sequentially: no parallel backend registered
>>> svm.m1
>> 1241 samples
>> 7 predictors
>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>
>> No pre-processing
>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>
>> Summary of sample sizes: 1118, 1117, 1115, 1117, 1116, 1118, ...
>>
>> Resampling results across tuning parameters:
>>
>> C Accuracy Kappa Accuracy SD Kappa SD
>> 0.25 0.372 0.278 0.033 0.0371
>> 0.5 0.39 0.297 0.0317 0.0358
>> 1 0.399 0.307 0.0289 0.0323
>>
>> Tuning parameter ‘sigma’ was held constant at a value of 0.2148907
>> Kappa was used to select the optimal model using the largest value.
>> The final values used for the model were C = 1 and sigma = 0.215.
>>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>
>> locale:
>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] e1071_1.6-1 class_7.3-9 kernlab_0.9-19 repmis_0.2.6.2 caret_5.17-7 reshape2_1.2.2 plyr_1.8 lattice_0.20-24 foreach_1.4.1 cluster_1.14.4
>>
>> loaded via a namespace (and not attached):
>> [1] codetools_0.2-8 compiler_3.0.2 digest_0.6.3 grid_3.0.2 httr_0.2 iterators_1.0.6 RCurl_1.95-4.1 stringr_0.6.2 tools_3.0.2
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Max
--
Max
More information about the R-help
mailing list