[R] randomForest() for regression produces offset predictions
David Katz
david at davidkatzconsulting.com
Fri Dec 21 00:37:52 CET 2007
I would expect this regression towards the mean behavior on a new or hold out
dataset, not on the training data. In RF terminology, this means that the
model prediction from predict is the in-bag estimate, but the out-of-bag
estimate is what you want for prediction. In Joshua's example,
rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it
appears that the result is the in-bag estimate, which still needs an
adjustment like Joshua's (and perhaps a more complex one might be needed in
some cases). This is a bit confusing since predict() usually matches what's
in model$fitted.values. I imagine that's why the author used "predicted" as
the component name instead of the standard "fitted.values".
The documentation for predict.randomForest explains:
"newdata - a data frame or matrix containing new data. (Note: If not given,
the out-of-bag prediction in object is returned. "
Patrick Burns wrote:
>
> What I see is the predictions being less extreme than the
> actual values -- predictions for large actual values are smaller
> than the actual, and predictions for small actual values are
> larger than the actual. That makes sense to me. The object
> is to maximize out-of-sample predictive power, not in-sample
> predictive power.
>
> Or am I missing something in what you are saying?
>
>
> Patrick Burns
> patrick at burns-stat.com
> +44 (0)20 8525 0696
> http://www.burns-stat.com
> (home of S Poetry and "A Guide for the Unwilling S User")
>
>
> Joshua Knowles wrote:
>
>>Hi all,
>>
>>I have observed that when using the randomForest package to do regression,
the
>>predicted values of the dependent variable given by a trained forest are
not
>>centred and have the wrong slope when plotted against the true values.
>>
>>This means that the R^2 value obtained by squaring the Pearson correlation
are
>>better than those obtained by computing the coefficient of determination
>>directly. The R^2 value obtained by squaring the Pearson can, however, be
>>exactly reproduced by the coeff. of det. if the predicted values are first
>>linearly transformed (using lm() to find the required intercept and
slope).
>>
>>Does anyone know why the randomForest behaves in this way - producing
offset
>>predictions? Does anyone know a fix for the problem?
>>
>>(By the way, the feature is there even if the original dependent variable
>>values are initially transformed to have zero mean and unit variance.)
>>
>>As an example, here is some simple R code that uses the available swiss
>>dataset to show the effect I am observing.
>>
>>Thanks for any help.
>>
>>--
>>#### EXAMPLE OF RANDOM FOREST REGRESSION
>>
>>library(randomForest)
>>data(swiss)
>>swiss
>>
>>#Build the random forest to predict Infant Mortality
>>rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss)
>>
>>#And predict the training set again
>>pred<-c(predict(rf.rf,swiss))
>>actual<-swiss$Infant.Mortality
>>
>>#Plotting predicted against actual values shows the effect (uncomment to
see
>>this)
>>#plot(pred,actual)
>>#abline(0,1)
>>
>># calculate R^2 as pearson coefficient squared
>>R2one<-cor(pred,actual)^2
>>
>># calculate R^2 value as fraction of variance explained
>>residOpt<-(actual-pred)
>>residnone<-(actual-mean(actual))
>>R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>>
>># now fit a line through the predicted and true values and
>># use this to normalize the data before calculating R^2
>>
>>fit<-lm(actual ~ pred)
>>coef(fit)
>>pred2<-pred*coef(fit)[2]+coef(fit)[1]
>>residOpt<-(actual-pred2)
>>R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
>>
>>cat("Pearson squared = ",R2one,"\n")
>>cat("Coeff of determination = ", R2two, "\n")
>>cat("Coeff of determination after linear fitting = ", R2three, "\n")
>>
>>## END
>>
>>
>>
>>
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
--
View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list