[R] Question about randomForest

Liaw, Andy andy_liaw at merck.com
Mon Nov 28 16:53:55 CET 2011


Not only that, but in the same help page, same "Value" section, it says:

predicted 	the predicted values of the input data based on out-of-bag samples
 
so people really should read the help pages instead of speculate...

If the error rates were not based on OOB samples, they would drop to (near) 0 rather quickly, as each tree is intentially overfitting its training set.

Andy
 

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Weidong Gu
> Sent: Sunday, November 27, 2011 10:56 AM
> To: Matthew Francis
> Cc: r-help at r-project.org
> Subject: Re: [R] Question about randomForest
> 
> Matthew,
> 
> Your intepretation of calculating error rates based on the training
> data is incorrect.
> 
> In Andy Liaw's help file "err.rate-- (classification only) vector
> error rates of the prediction on the input data, the i-th element
> being the (OOB) error rate for all trees up to the i-th."
> 
> My understanding is that the error rate is calculated by throwing the
> OOB cases(after a few trees, all cases in the original data would
> serve as OOB for some trees) to all the trees up to the i-th which
> they are OOB and get the majority vote. The plot of a rf object
> indicates that OOB error declines quickly after the ensemble becomes
> sizable and increase variation in trees works! ( If they are based on
> the training sets, you wouldn't see such a drop since each tree is
> overfitting to the training set)
> 
> Weidong
> 
> 
> On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis
> <mattjamesfrancis at gmail.com> wrote:
> > Thanks for the help. Let me explain in more detail how I think that
> > randomForest works so that you (or others) can more easily see the
> > error of my ways.
> >
> > The function first takes a random sample of the data, of the size
> > specified by the sampsize argument. With this it fully grows a tree
> > resulting in a horribly over-fitted classifier for the 
> random sub-set.
> > It then repeats this again with a different sample to generate the
> > next tree and so on.
> >
> > Now, my understanding is that after each tree is constructed, a test
> > prediction for the *whole* training data set is made by 
> combining the
> > results of all trees (so e.g. for classification the 
> majority votes of
> > all individual tree predictions). From this an error rate is
> > determined (applicable to the ensemble applied to the training data)
> > and reported in the err.rate member of the returned randomForest
> > object. If you look at the error rate (or plot it using the default
> > plot method) you see that it starts out very high when only 
> 1 or a few
> > over-fitted trees are contributing, but once the forest gets larger
> > the error rate drops since the ensemble is doing its job. It doesn't
> > make sense to me that this error rate is for a sub-set of the data,
> > since the sub-set in question changes at each step (i.e. at 
> each tree
> > construction)?
> >
> > By doing cross-validation test making 'training' and 'test' 
> sets from
> > the data I have, I do find that I get error rates on the test sets
> > comparable to the error rate that is obtained from the prediction
> > member of the returned randomForest object. So that does seem to be
> > the 'correct' error.
> >
> > By my understanding the error reported for the ith tree is that
> > obtained using all trees up to and including the ith tree to make an
> > ensemble prediction. Therefore the final error reported 
> should be the
> > same as that obtained using the predict.randomForest function on the
> > training set, because by my understanding that should return an
> > identical result to that used to generate the error rate 
> for the final
> > tree constructed??
> >
> > Sorry that is a bit long winded, but I hope someone can point out
> > where I'm going wrong and set me straight.
> >
> > Thanks!
> >
> > On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu 
> <anopheles123 at gmail.com> wrote:
> >> Hi Matthew,
> >>
> >> The error rate reported by randomForest is the prediction 
> error based
> >> on out-of-bag OOB data. Therefore, it is different from prediction
> >> error on the original data  since each tree was built 
> using bootstrap
> >> samples (about 70% of the original data), and the error 
> rate of OOB is
> >> likely higher than the prediction error of the original data as you
> >> observed.
> >>
> >> Weidong
> >>
> >> On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
> >> <mattjamesfrancis at gmail.com> wrote:
> >>> I've been using the R package randomForest but there is 
> an aspect I
> >>> cannot work out the meaning of. After calling the randomForest
> >>> function, the returned object contains an element called 
> prediction,
> >>> which is the prediction obtained using all the trees (at 
> least that's
> >>> my understanding). I've checked that this prediction set 
> has the error
> >>> rate as reported by err.rate.
> >>>
> >>> However, if I send the training data back into the the
> >>> predict.randomForest function I find I get a different 
> result to the
> >>> stored set of predictions. This is true for both 
> classification and
> >>> regression. I find the predictions obtained this way also 
> have a much
> >>> lower error rate and perform very well (suspiciously well...) on
> >>> measures such as AUC.
> >>>
> >>> My understanding is that the two predictions above should 
> be the same.
> >>> Since they are not, I must be not understanding something 
> properly.
> >>> Any ideas what's going on?
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}



More information about the R-help mailing list