[R] Question about randomForest

Sun Nov 27 09:21:01 CET 2011

Thanks for the help. Let me explain in more detail how I think that
randomForest works so that you (or others) can more easily see the
error of my ways.

The function first takes a random sample of the data, of the size
specified by the sampsize argument. With this it fully grows a tree
resulting in a horribly over-fitted classifier for the random sub-set.
It then repeats this again with a different sample to generate the
next tree and so on.

Now, my understanding is that after each tree is constructed, a test
prediction for the *whole* training data set is made by combining the
results of all trees (so e.g. for classification the majority votes of
all individual tree predictions). From this an error rate is
determined (applicable to the ensemble applied to the training data)
and reported in the err.rate member of the returned randomForest
object. If you look at the error rate (or plot it using the default
plot method) you see that it starts out very high when only 1 or a few
over-fitted trees are contributing, but once the forest gets larger
the error rate drops since the ensemble is doing its job. It doesn't
make sense to me that this error rate is for a sub-set of the data,
since the sub-set in question changes at each step (i.e. at each tree
construction)?

By doing cross-validation test making 'training' and 'test' sets from
the data I have, I do find that I get error rates on the test sets
comparable to the error rate that is obtained from the prediction
member of the returned randomForest object. So that does seem to be
the 'correct' error.

By my understanding the error reported for the ith tree is that
obtained using all trees up to and including the ith tree to make an
ensemble prediction. Therefore the final error reported should be the
same as that obtained using the predict.randomForest function on the
training set, because by my understanding that should return an
identical result to that used to generate the error rate for the final
tree constructed??

Sorry that is a bit long winded, but I hope someone can point out
where I'm going wrong and set me straight.

Thanks!

On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu <anopheles123 at gmail.com> wrote:
> Hi Matthew,
>
> The error rate reported by randomForest is the prediction error based
> on out-of-bag OOB data. Therefore, it is different from prediction
> error on the original data  since each tree was built using bootstrap
> samples (about 70% of the original data), and the error rate of OOB is
> likely higher than the prediction error of the original data as you
> observed.
>
> Weidong
>
> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
> <mattjamesfrancis at gmail.com> wrote:
>> I've been using the R package randomForest but there is an aspect I
>> cannot work out the meaning of. After calling the randomForest
>> function, the returned object contains an element called prediction,
>> which is the prediction obtained using all the trees (at least that's
>> my understanding). I've checked that this prediction set has the error
>> rate as reported by err.rate.
>>
>> However, if I send the training data back into the the
>> predict.randomForest function I find I get a different result to the
>> stored set of predictions. This is true for both classification and
>> regression. I find the predictions obtained this way also have a much
>> lower error rate and perform very well (suspiciously well...) on
>> measures such as AUC.
>>
>> My understanding is that the two predictions above should be the same.
>> Since they are not, I must be not understanding something properly.
>> Any ideas what's going on?
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>