[R] randomForest() for regression produces offset predictions

David Katz david at davidkatzconsulting.com
Fri Dec 21 00:37:52 CET 2007


I would expect this regression towards the mean behavior on a new or hold out
dataset, not on the training data. In RF terminology, this means that the
model prediction from predict is the in-bag estimate, but the out-of-bag
estimate is what you want for prediction. In Joshua's example,
rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it
appears that the result is the in-bag estimate, which still needs an
adjustment like Joshua's  (and perhaps a more complex one might be needed in
some cases). This is a bit confusing since predict() usually matches what's
in model$fitted.values. I imagine that's why the author used "predicted" as
the component name instead of the standard "fitted.values".

The documentation for predict.randomForest explains:

"newdata - a data frame or matrix containing new data. (Note: If not given,
the out-of-bag prediction in object is returned. " 



Patrick Burns wrote:
> 
> What I see is the predictions being less extreme than the
> actual values -- predictions for large actual values are smaller
> than the actual, and predictions for small actual values are
> larger than the actual.  That makes sense to me.  The object
> is to maximize out-of-sample predictive power, not in-sample
> predictive power.
> 
> Or am I missing something in what you are saying?
> 
> 
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
> 
> 
> Joshua Knowles wrote:
> 
>>Hi all,
>> 
>>I have observed that when using the randomForest package to do regression,
the 
>>predicted values of the dependent variable given by a trained forest are
not 
>>centred and have the wrong slope when plotted against the true values.
>> 
>>This means that the R^2 value obtained by squaring the Pearson correlation
are 
>>better than those obtained by computing the coefficient of determination 
>>directly. The R^2 value obtained by squaring the Pearson can, however, be 
>>exactly reproduced by the coeff. of det. if the predicted values are first 
>>linearly transformed (using lm() to find the required intercept and
slope).
>> 
>>Does anyone know why the randomForest behaves in this way - producing
offset 
>>predictions? Does anyone know a fix for the problem?
>> 
>>(By the way, the feature is there even if the original dependent variable 
>>values are initially transformed to have zero mean and unit variance.)
>> 
>>As an example, here is some simple R code that uses the available swiss 
>>dataset to show the effect I am observing.
>>
>>Thanks for any help.
>> 
>>--
>>#### EXAMPLE OF RANDOM FOREST REGRESSION
>> 
>>library(randomForest)
>>data(swiss)
>>swiss
>> 
>>#Build the random forest to predict Infant Mortality
>>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
>> 
>>#And predict the training set again
>>pred<-c(predict(rf.rf,swiss))
>>actual<-swiss$Infant.Mortality
>> 
>>#Plotting predicted against actual values shows the effect (uncomment to
see
>>this)
>>#plot(pred,actual)
>>#abline(0,1)
>> 
>># calculate R^2 as pearson coefficient squared
>>R2one<-cor(pred,actual)^2
>> 
>># calculate R^2 value as fraction of variance explained
>>residOpt<-(actual-pred)
>>residnone<-(actual-mean(actual))
>>R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>> 
>># now fit a line through the predicted and true values and
>># use this to normalize the data before calculating R^2
>> 
>>fit<-lm(actual ~ pred)
>>coef(fit)
>>pred2<-pred*coef(fit)[2]+coef(fit)[1]
>>residOpt<-(actual-pred2)
>>R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>> 
>>cat("Pearson squared = ",R2one,"\n")
>>cat("Coeff of determination = ", R2two, "\n")
>>cat("Coeff of determination after linear fitting = ", R2three, "\n")
>> 
>>## END
>> 
>>
>>  
>>
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list