[R] Prediction accuracy from Bagging with continuous data

Fri Feb 11 01:49:56 CET 2011

On Thu, Feb 10, 2011 at 8:45 AM, Simon Gillings <simon.gillings at bto.org> wrote:
> I am using bagging to perform Bagged Regression Trees on count data (bird abundance in Britain and Ireland, in relation to climate and land cover variables). Predictions from the final model are visually believable but I would really like a diagnostic equivalent to classification success that can be used to decide if a model is adequate. Whereas with classification data an error rate is returned, with continuous data only the root mean squared error is returned. The RMSE is helpful for comparing different models for the same species and deciding which is best, but as far as I can tell it offers no absolute measure of how good that best model is.
>
> At present I am using the final model to make predictions for the original dataset and then computing a correlation coefficient between observed and predicted values but I expect this is probably biased high due to non-independence. Ideally I think I need the correlation coefficient between the predictions and observed values for the out of bag sample for each of the n trees produced, but I don't see this produced anywhere.
>
> Does anyone know of a means of getting a useful unbiased diagnostic for assessing overall fit?
>

Not sure this suggestion is going to help you, but you could switch to
the Random Forest ensemble of regression trees (package randomForest).
The Random Forest predictor automatically calculates predicted values
from/on out-of-bag samples and hence will give you a source to
calculate an unbiased estimate of accuracy.

Peter