[R] caret: Error when using rpart and CV != LOOCV

Max Kuhn mxkuhn at gmail.com
Thu May 17 04:10:45 CEST 2012


Dominik,

See this line:

>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  30.37   30.37   30.37   30.37   30.37   30.37

The variance of the predictions is zero. caret uses the formula for
R^2 by calculating the correlation between the observed data and the
predictions which uses sd(pred) which is zero. I believe that the same
would occur with other formulas for R^2.

Max

On Wed, May 16, 2012 at 11:54 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
> Thanks Max for your answer.
>
> First, I do not understand your post. Why is it a problem if two of
> predictions match? From the formula for calculating R^2 I can see that
> there will be a DivByZero iff the total sum of squares is 0. This is
> only true if the predictions of all the predicted points from the
> test-set are equal to the mean of the test-set. Why should this happen?
>
> Anyway, I wrote the following code to check what you tried to tell:
>
> --
> library(caret)
> data(trees)
> formula=Volume~Girth+Height
>
> customSummary <- function (data, lev = NULL, model = NULL) {
>    print(summary(data$pred))
>    return(defaultSummary(data, lev, model))
> }
>
> tc=trainControl(method='cv', summaryFunction=customSummary)
> train(formula, data=trees,  method='rpart', trControl=tc)
> --
>
> This outputs:
> ---
>  Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  18.45   18.45   18.45   30.12   35.95   53.44
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  22.69   22.69   22.69   32.94   38.06   53.44
>   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
>  30.37   30.37   30.37   30.37   30.37   30.37
> [cut many values like this]
> Warning: In nominalTrainWorkflow(dat = trainData, info = trainInfo,
> method = method,  :
>  There were missing values in resampled performance measures.
> -----
>
> As I didn't understand your post, I don't know if this confirms your
> assumption.
>
> Thanks anyway,
> Dominik
>
>
> On 16/05/12 17:30, Max Kuhn wrote:
>> More information is needed to be sure, but it is most likely that some
>> of the resampled rpart models produce the same prediction for the
>> hold-out samples (likely the result of no viable split being found).
>>
>> Almost every incarnation of R^2 requires the variance of the
>> prediction. This particular failure mode would result in a divide by
>> zero.
>>
>> Try using you own summary function (see ?trainControl) and put a
>> print(summary(data$pred)) in there to verify my claim.
>>
>> Max
>>
>> On Wed, May 16, 2012 at 11:30 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
>>> More information is needed to be sure, but it is most likely that some
>>> of the resampled rpart models produce the same prediction for the
>>> hold-out samples (likely the result of no viable split being found).
>>>
>>> Almost every incarnation of R^2 requires the variance of the
>>> prediction. This particular failure mode would result in a divide by
>>> zero.
>>>
>>> Try using you own summary function (see ?trainControl) and put a
>>> print(summary(data$pred)) in there to verify my claim.
>>>
>>> Max
>>>
>>> On Tue, May 15, 2012 at 5:55 AM, Dominik Bruhn <dominik at dbruhn.de> wrote:
>>>> Hy,
>>>> I got the following problem when trying to build a rpart model and using
>>>> everything but LOOCV. Originally, I wanted to used k-fold partitioning,
>>>> but every partitioning except LOOCV throws the following warning:
>>>>
>>>> ----
>>>> Warning message: In nominalTrainWorkflow(dat = trainData, info =
>>>> trainInfo, method = method, : There were missing values in resampled
>>>> performance measures.
>>>> -----
>>>>
>>>> Below are some simplified testcases which repoduce the warning on my
>>>> system.
>>>>
>>>> Question: What does this error mean? How can I avoid it?
>>>>
>>>> System-Information:
>>>> -----
>>>>> sessionInfo()
>>>> R version 2.15.0 (2012-03-30)
>>>> Platform: x86_64-pc-linux-gnu (64-bit)
>>>>
>>>> locale:
>>>>  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
>>>>  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
>>>>  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
>>>>  [7] LC_PAPER=C                 LC_NAME=C
>>>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>>
>>>> other attached packages:
>>>> [1] rpart_3.1-52   caret_5.15-023 foreach_1.4.0  cluster_1.14.2
>>>> reshape_0.8.4
>>>> [6] plyr_1.7.1     lattice_0.20-6
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] codetools_0.2-8 compiler_2.15.0 grid_2.15.0     iterators_1.0.6
>>>> [5] tools_2.15.0
>>>> -------
>>>>
>>>>
>>>> Simlified Testcase I: Throws warning
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> train(formula, data=trees,  method='rpart')
>>>> ---
>>>>
>>>> Simlified Testcase II: Every other CV-method also throws the warning,
>>>> for example using 'cv':
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> tc=trainControl(method='cv')
>>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>>> ---
>>>>
>>>> Simlified Testcase III: The only CV-method which is working is 'LOOCV':
>>>> ---
>>>> library(caret)
>>>> data(trees)
>>>> formula=Volume~Girth+Height
>>>> tc=trainControl(method='LOOCV')
>>>> train(formula, data=trees,  method='rpart', trControl=tc)
>>>> ---
>>>>
>>>>
>>>> Thanks!
>>>> --
>>>> Dominik Bruhn
>>>> mailto: dominik at dbruhn.de
>>>>
>>>>
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Max
>>
>>
>>
>
>
> --
> Dominik Bruhn
> mailto: dominik at dbruhn.de
>



-- 

Max



More information about the R-help mailing list