[R] Inconsistent results between caret+kernlab versions
Andrew Digby
andrewdigby at mac.com
Mon Nov 18 02:23:21 CET 2013
Hi Max,
Thanks very much for investigating and explaining that - your help and time is much appreciated.
So as I understand it, using classProbs=F in trainControl() will give me the same accuracy results as before. However, I was relying on the class probabilities to return ROC/sensitivity/specificity, using a custom function similar to twoClassSummary().
What I still don't quite understand is which accuracy values from train() I should trust: those using classProbs=T or classProbs=F? I'm using train() to compare different classification methods using several stats (accuracy, AUROC etc), but this issue means that suddenly SVM has got much worse (based on accuracy)! I guess this means that I should roll back to the earlier versions of caret and kernlab (which is a pain because then train often crashes with 'memory map' errors!)?
Thanks,
Andrew
On 16/11/2013, at 09:59 , Max Kuhn <mxkuhn at gmail.com> wrote:
> Or not!
>
> The issue with with kernlab.
>
> Background: SVM models do not naturally produce class probabilities. A
> secondary model (via Platt) is fit to the raw model output and a
> logistic function is used to translate the raw SVM output to
> probability-like numbers (i.e. sum to zero, between 0 and 1). In
> ksvm(), you need to use the option prob.model = TRUE to get that
> second model.
>
> I discovered some time ago that there can be a discrepancy in the
> predicted classes that naturally come from the SVM model and those
> derived by using the class associated with the largest class
> probability. This is most likely do to natural error in the secondary
> probability model and should not be unexpected.
>
> That is the case for your data. In you use the same tuning parameters
> as those suggested by train() and go straight to ksvm():
>
>> newSVM <- ksvm(x = as.matrix(df[,-1]),
> + y = df[,1],
> + kernel = rbfdot(sigma = svm.m1$bestTune$.sigma),
> + C = svm.m1$bestTune$.C,
> + prob.model = TRUE)
>>
>> predict(newSVM, df[43,-1])
> [1] O32078
> 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
>> predict(newSVM, df[43,-1], type = "probabilities")
> O27479 O31403 O32057 O32059 O32060 O32078
> [1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394
> O32089 O32663 O32668 O32676
> [1,] 0.04890477 0.05210836 0.09838892 0.07284396
>
> Note that, based on the probability model, the class with the largest
> probability is O32057 (p = 0.24) while the basic SVM model predicts
> O32078 (p = 0.16).
>
> Somebody (maybe me) saw this discrepancy and that led to me to follow this rule:
>
> if(prob.model = TRUE) use the class with the maximum probability
> else use the class prediction from ksvm().
>
> Therefore:
>
>> predict(svm.m1, df[43,-1])
> [1] O32057
> 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676
>
> That change occurred between the two caret versions that you tested with.
>
> (On a side note, can also occur with ksvm() and rpart() if
> cost-sensitive training is used because the class designation takes
> into account the costs but the class probability predictions do not. I
> alerted both package maintainers to the issue some time ago.)
>
> HTH,
>
> Max
>
> On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn <mxkuhn at gmail.com> wrote:
>> I've looked into this a bit and the issue seems to be with caret. I've
>> been looking at the svn check-ins and nothing stands out to me as the
>> issue so far. The final models that are generated are the same and
>> I'll try to figure out the difference.
>>
>> Two small notes:
>>
>> 1) you should set the seed to ensure reproducibility.
>> 2) you really shouldn't use character stings with all numbers as
>> factor levels with caret when you want class probabilities. It should
>> give you a warning about this
>>
>> Max
>>
>> On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby <andrewdigby at mac.com> wrote:
>>>
>>> I'm using caret to assess classifier performance (and it's great!). However, I've found that my results differ between R2.* and R3.* - reported accuracies are reduced dramatically. I suspect that a code change to kernlab ksvm may be responsible (see version 5.16-24 here: http://cran.r-project.org/web/packages/caret/news.html). I get very different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + kernlab_0.9-19 (see below).
>>>
>>> Can anyone please shed any light on this?
>>>
>>> Thanks very much!
>>>
>>>
>>> ### To replicate:
>>>
>>> require(repmis) # For downloading from https
>>> df <- source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', sep=',')
>>> require(caret)
>>> svm.m1 <- train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv', number=10, repeats=10, classProbs=TRUE))
>>> svm.m1
>>> sessionInfo()
>>>
>>> ### Results - R2.15.2
>>>
>>>> svm.m1
>>> 1241 samples
>>> 7 predictors
>>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>>
>>> No pre-processing
>>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>>
>>> Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ...
>>>
>>> Resampling results across tuning parameters:
>>>
>>> C Accuracy Kappa Accuracy SD Kappa SD
>>> 0.25 0.684 0.63 0.0353 0.0416
>>> 0.5 0.729 0.685 0.0379 0.0445
>>> 1 0.756 0.716 0.0357 0.0418
>>>
>>> Tuning parameter ‘sigma’ was held constant at a value of 0.247
>>> Kappa was used to select the optimal model using the largest value.
>>> The final values used for the model were C = 1 and sigma = 0.247.
>>>> sessionInfo()
>>> R version 2.15.2 (2012-10-26)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>>
>>> locale:
>>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-17 repmis_0.2.4 caret_5.15-61 reshape2_1.2.2 plyr_1.8 lattice_0.20-10 foreach_1.4.0 cluster_1.14.3
>>>
>>> loaded via a namespace (and not attached):
>>> [1] codetools_0.2-8 compiler_2.15.2 digest_0.6.0 evaluate_0.4.3 formatR_0.7 grid_2.15.2 httr_0.2 iterators_1.0.6 knitr_1.1 RCurl_1.95-4.1 stringr_0.6.2 tools_2.15.2
>>>
>>> ### Results - R3.0.2
>>>
>>>> require(caret)
>>>> svm.m1 <- train(df[,-1],df[,1],method=’svmRadial’,metric=’Kappa’,tunelength=5,trControl=trainControl(method=’repeatedcv’, number=10, repeats=10, classProbs=TRUE))
>>> Loading required package: class
>>> Warning messages:
>>> 1: closing unused connection 4 (https://dl.dropboxusercontent.com/u/47973221/df.Rdata)
>>> 2: executing %dopar% sequentially: no parallel backend registered
>>>> svm.m1
>>> 1241 samples
>>> 7 predictors
>>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’
>>>
>>> No pre-processing
>>> Resampling: Cross-Validation (10 fold, repeated 10 times)
>>>
>>> Summary of sample sizes: 1118, 1117, 1115, 1117, 1116, 1118, ...
>>>
>>> Resampling results across tuning parameters:
>>>
>>> C Accuracy Kappa Accuracy SD Kappa SD
>>> 0.25 0.372 0.278 0.033 0.0371
>>> 0.5 0.39 0.297 0.0317 0.0358
>>> 1 0.399 0.307 0.0289 0.0323
>>>
>>> Tuning parameter ‘sigma’ was held constant at a value of 0.2148907
>>> Kappa was used to select the optimal model using the largest value.
>>> The final values used for the model were C = 1 and sigma = 0.215.
>>>> sessionInfo()
>>> R version 3.0.2 (2013-09-25)
>>> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>>>
>>> locale:
>>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] e1071_1.6-1 class_7.3-9 kernlab_0.9-19 repmis_0.2.6.2 caret_5.17-7 reshape2_1.2.2 plyr_1.8 lattice_0.20-24 foreach_1.4.1 cluster_1.14.4
>>>
>>> loaded via a namespace (and not attached):
>>> [1] codetools_0.2-8 compiler_3.0.2 digest_0.6.3 grid_3.0.2 httr_0.2 iterators_1.0.6 RCurl_1.95-4.1 stringr_0.6.2 tools_3.0.2
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>>
>> Max
>
>
>
> --
>
> Max
More information about the R-help
mailing list