[R] caret: Error when using rpart and CV != LOOCV
Max Kuhn
mxkuhn at gmail.com
Thu May 17 04:10:45 CEST 2012
Dominik,
See this line:
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 30.37 30.37 30.37 30.37 30.37 30.37
The variance of the predictions is zero. caret uses the formula for
R^2 by calculating the correlation between the observed data and the
predictions which uses sd(pred) which is zero. I believe that the same
would occur with other formulas for R^2.
Max
On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
> Thanks Max for your answer.
>
> First, I do not understand your post. Why is it a problem if two of
> predictions match? From the formula for calculating R^2 I can see that
> there will be a DivByZero iff the total sum of squares is 0. This is
> only true if the predictions of all the predicted points from the
> test-set are equal to the mean of the test-set. Why should this happen?
>
> Anyway, I wrote the following code to check what you tried to tell:
>
> --
> library(caret)
> data(trees)
> formula=Volume~Girth+Height
>
> customSummary <- function (data, lev = NULL, model = NULL) {
> print(summary(data$pred))
> return(defaultSummary(data, lev, model))
> }
>
> tc=trainControl(method='cv', summaryFunction=customSummary)
> train(formula, data=trees, method='rpart', trControl=tc)
> --
>
> This outputs:
> ---
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 18.45 18.45 18.45 30.12 35.95 53.44
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 22.69 22.69 22.69 32.94 38.06 53.44
> Min. 1st Qu. Median Mean 3rd Qu. Max.
> 30.37 30.37 30.37 30.37 30.37 30.37
> [cut many values like this]
> Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
> method = method, :
> There were missing values in resampled performance measures.
> -----
>
> As I didn't understand your post, I don't know if this confirms your
> assumption.
>
> Thanks anyway,
> Dominik
>
>
> On 16/05/12 17:30, Max Kuhn wrote:
>> More information is needed to be sure, but it is most likely that some
>> of the resampled rpart models produce the same prediction for the
>> hold-out samples (likely the result of no viable split being found).
>>
>> Almost every incarnation of R^2 requires the variance of the
>> prediction. This particular failure mode would result in a divide by
>> zero.
>>
>> Try using you own summary function (see ?trainControl) and put a
>> print(summary(data$pred)) in there to verify my claim.
>>
>> Max
>>
>> On Wed, May 16, 2012 at 11:30 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
>>> More information is needed to be sure, but it is most likely that some
>>> of the resampled rpart models produce the same prediction for the
>>> hold-out samples (likely the result of no viable split being found).
>>>
>>> Almost every incarnation of R^2 requires the variance of the
>>> prediction. This particular failure mode would result in a divide by
>>> zero.
>>>
>>> Try using you own summary function (see ?trainControl) and put a
>>> print(summary(data$pred)) in there to verify my claim.
>>>
>>> Max
>>>
>>> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
>>>> Hy,
>>>> I got the following problem when trying to build a rpart model and using
>>>> everything but LOOCV. Originally, I wanted to used k-fold partitioning,
>>>> but every partitioning except LOOCV throws the following warning:
>>>>
>>>> ----
>>>> Warning message: In nominalTrainWorkflow(dat = trainData, info =
>>>> trainInfo, method = method, : There were missing values in resampled
>>>> performance measures.
>>>> -----
>>>>
>>>> Below are some simplified testcases which repoduce the warning on my
>>>> system.
>>>>
>>>> Question: What does this error mean? How can I avoid it?
>>>>
>>>> System-Information:
>>>> -----
>>>>> sessionInfo()
>>>> R version 2.15.0 (2012-03-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
>>>> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
>>>> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
>>>> [7] LC_PAPER=C LC_NAME=C
>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats graphics grDevices utils datasets methods base
>>>>
>>>> other attached packages:
>>>> [1] rpart_3.1-52 caret_5.15-023 foreach_1.4.0 cluster_1.14.2
>>>> reshape_0.8.4
>>>> [6] plyr_1.7.1 lattice_0.20-6
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0 iterators_1.0.6
>>>> [5] tools_2.15.0
>>>> -------
>>>>
>>>>
>>>> Simlified Testcase I: Throws warning
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> train(formula, data=trees, method='rpart')
>>>> ---
>>>>
>>>> Simlified Testcase II: Every other CV-method also throws the warning,
>>>> for example using 'cv':
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> tc=trainControl(method='cv')
>>>> train(formula, data=trees, method='rpart', trControl=tc)
>>>> ---
>>>>
>>>> Simlified Testcase III: The only CV-method which is working is 'LOOCV':
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> tc=trainControl(method='LOOCV')
>>>> train(formula, data=trees, method='rpart', trControl=tc)
>>>> ---
>>>>
>>>>
>>>> Thanks!
>>>> --
>>>> Dominik Bruhn
>>>> mailto: dominik at dbruhn.de
>>>>
>>>>
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Max
>>
>>
>>
>
>
> --
> Dominik Bruhn
> mailto: dominik at dbruhn.de
>
--
Max
More information about the R-help
mailing list