[R] Rpart and bagging - how is it done?

Liaw, Andy andy_liaw at merck.com
Mon Mar 10 16:24:23 CET 2008


I suppose better late than never:  It's possible to get bagging in
randomForest by simply setting mtry equal to the number of  predictor
variables.  Note that this is one thing that I changed from Breiman &
Cutler's Fortran code:  They were sampling variables with replacement,
so if you use that code and set mtry to the number of predictor
variables, you're still not bagging.

One of my colleagues, Ting Wang, had provided a Matlab interface to
Breiman & Cutler's Fortran code (V3.1 I believe) that he made available
on StatLib.  It would be a fairly simple change to the underlying
Fortran code to get it to do bagging.

Andy

From: apjaworski at mmm.com
 
> I would like to thank Brian Ripley and Torsten Hothorn for 
> their quick and
> thoughtful responses.
> 
> I rerun the example given by Professor Ripley by just starting R and
> sourcing the code below and I got slightly different results. 
>  Then I ran
> it again setting the random seed before the sample command and I got
> identical results a few times.  However, I found the example 
> below that
> seems to be a reproducible on my system (Win200 Pro, CoreDuo 
> Xeon about a
> year old).   I get the same results in 2.6.2 (patched March 
> 4) and 2.7.0
> (version of February 28).  Both were compiled from the 
> tarballs in Cygwin
> and up-to-date Rtools with no errors.  I just ran "make 
> fullcheck" on 2.6.2
> and it passes with no problems (just usual stuff - network conectivity
> fails due to our firewall and slight numercial differences in 
> a few cases.
> The results from the rpart test are attached included at the 
> bottom of this
> post.
> 
> set.seed(123)
> library(rpart)
> ind <- sample(1:81, replace=TRUE)
> rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
> rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
>        weights=tabulate(ind, nbins=81), xval=0)
> 
> Here is what I get:
> 
> > rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
> n= 81
> 
> node), split, n, loss, yval, (yprob)
>       * denotes terminal node
> 
> 1) root 81 14 absent (0.8271605 0.1728395) *
> > rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
> +        weights=tabulate(ind, nbins=81), xval=0)
> n= 81
> 
> node), split, n, loss, yval, (yprob)
>       * denotes terminal node
> 
>  1) root 81 14 absent (0.8271605 0.1728395)
>    2) Start>=8.5 62  6 absent (0.9062500 0.0937500)
>      4) Start>=14.5 29  0 absent (1.0000000 0.0000000) *
>      5) Start< 14.5 33  6 absent (0.8000000 0.2000000)
>       10) Age< 55 12  0 absent (1.0000000 0.0000000) *
>       11) Age>=55 21  6 absent (0.6000000 0.4000000)
>         22) Age>=111 14  2 absent (0.8000000 0.2000000) *
>         23) Age< 111 7  1 present (0.2000000 0.8000000) *
>    3) Start< 8.5 19  8 absent (0.5294118 0.4705882) *
> 
> The trees are dramatically different (the first one is just a 
> root).  The
> predictions are of course different (the first model predicts 
> all cases as
> absent) but the total number of misclassified observations 
> differs by only
> 1 (17 vs. 16).
> 
> Can anyone reproduce this, or is something wrong with my system?
> 
> Thanks again,
> 
> Andy
> 
> PS.  rpart version is 3.1-39
> 
> rpart results from "make fullcheck"
> 
> -------- Testing package rpart --------
> Massaging examples into 'rpart-Ex.R' ...
> Running examples in 'rpart-Ex.R' ...
> Running specific tests
>   Running `surv_test.R'
>   Running `testall.R'
>   Comparing `testall.Rout' to `testall.Rout.save' ...127c127
> <       g2      < 22.77 to the right, improve=6.8130, (6 missing)
> ---
> >       g2      < 22.76 to the right, improve=6.8130, (6 missing)
> 159c159
> <       g2      < 22.77 to the right, improve=4.8340, (6 missing)
> ---
> >       g2      < 22.76 to the right, improve=4.8340, (6 missing)
> 193c193
> <       grade < 3.5   to the left,  agree=0.772, adj=0.188, (0 split)
> ---
> >       grade < 3.5   to the left,  agree=0.772, adj=0.187, (0 split)
> 199c199
> <       g2      < 13.47 to the left,  improve=3.55300, (0 missing)
> ---
> >       g2      < 13.48 to the left,  improve=3.55300, (0 missing)
> 241c241
> <  1) root 146 53.420  5.893e-18
> ---
> >  1) root 146 53.420 -4.563e-17
> 275c275
> <   mean=5.893e-18, MSE=0.3659
> ---
> >   mean=-4.563e-17, MSE=0.3659
> 346c346
> <       g2      < 13.47 to the left,  improve=4.238e-02, (3 missing)
> ---
> >       g2      < 13.48 to the left,  improve=4.238e-02, (3 missing)
> 375c375
> <       g2      < 17.91 to the right, improve=0.1271000, (1 missing)
> ---
> >       g2      < 17.92 to the right, improve=0.1271000, (1 missing)
> 515c515
> <       g2      < 13.47 to the left,  improve=1.94600, (3 missing)
> ---
> >       g2      < 13.48 to the left,  improve=1.94600, (3 missing)
> 555c555
> <       g2      < 17.91 to the right, improve=3.122000, (1 missing)
> ---
> >       g2      < 17.92 to the right, improve=3.122000, (1 missing)
> 647c647
> <       life       < 70.25 to the right, improve=0.25230, (0 missing)
> ---
> >       life       < 70.26 to the right, improve=0.25230, (0 missing)
> OK
>   Running `usersplits.R'
>   Comparing `usersplits.Rout' to `usersplits.Rout.save' ...174c174
> < Timing ratio =  3.2
> ---
> > Timing ratio =  5.9
> OK
> 
> __________________________________
> Andy Jaworski
> 518-1-01
> Process Laboratory
> 3M Corporate Research Laboratory
> -----
> E-mail: apjaworski at mmm.com
> Tel:  (651) 733-6092
> Fax:  (651) 736-3122
> 
> 
>                                                               
>              
>              Prof Brian Ripley                                
>              
>              <ripley at stats.ox.                                
>              
>              ac.uk>                                           
>           To 
>                                        apjaworski at mmm.com     
>              
>              03/07/2008 03:11                                 
>           cc 
>              AM                        
> Torsten.Hothorn at R-project.org       
>                                        R-help at R-project.org   
>              
>                                                               
>      Subject 
>                                        Re: [R] Rpart and 
> bagging - how is  
>                                        it done?               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
>                                                               
>              
> 
> 
> 
> 
> I believe that the procedure you describe at the end (resampling the
> cases) is the original interpretation of bagging, and that 
> using weighting
> is equivalent when a procedure uses case weights.
> 
> If you are getting different results when replicating cases 
> and when using
> weights then rpart is not using its weights strictly as case 
> weights and
> it would be preferable to replicate cases.  But I am getting identical
> predictions by the two routes:
> 
> ind <- sample(1:81, replace=TRUE)
> rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
> rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
>        weights=tabulate(ind, nbins=81), xval=0)
> 
> My memory is that rpart uses unweighted numbers for its control params
> (unlike tree) and hence is not strictly using case weights.  
> I believe you
> can avoid that by setting the control params to their minimum 
> and relying
> on pruning.
> 
> BTW, it is inaccurate to call these trees 'non-pruned' -- the default
> setting of cp is still (potentially) doing quite a lot of pruning.
> 
> Torsten Hothorn can explain why he chose to do what he did.  There's a
> small (but only small) computational advantage in using case 
> weights, but
> the tricky issue for me is how precisely tree growth is stopped, and I
> don't think that rpart at its default settings is mimicing 
> what Breiman
> was doing (he would have been growing much larger trees).
> 
> 
> On Thu, 6 Mar 2008, apjaworski at mmm.com wrote:
> 
> >
> > Hi there.
> >
> > I was wondering if somebody knows how to perform a bagging 
> procedure on a
> > classification tree without running the classifier with weights.
> >
> > Let me first explain why I need this and then give some 
> details of what I
> > have found out so far.
> >
> > I am thinking about implementing the bagging procedure in 
> Matlab.  Matlab
> > has a simple classification tree function (in their 
> Statistics toolbox)
> but
> > it does not accept weights.  A modification of the Matlab 
> procedure to
> > accommodate weights would be very complicated.
> >
> > The rpart function in R accepts weights.  This seems to allow for a
> rather
> > simple implementation of bagging.  In fact Everitt and 
> Hothorn in chapter
> 8
> > of "A Handbook of Statistical Analyses Using R" describe such a
> procedure.
> > The procedure consists in generating several samples with 
> replacement
> from
> > the original data set.  This data set has N rows.  The 
> implementation
> > described in the book first fits a non-pruned tree to the 
> original data
> > set.  Then it generates several (say, 25) multinomial 
> samples of size N
> > with probabilities 1/N.  Then, each sample is used in turn 
> as the weight
> > vector to update the original tree fit.  Finally, all the 
> updated trees
> are
> > combined to produce "consensus" class predictions.
> >
> > Now, a typical realization of a multinomial sample consists of small
> > integers and several 0's.  I thought that the way that 
> weighting worked
> was
> > this:  the observations with weights equal to 0 are omitted and the
> > observations with weights > 1 are essentially replicated 
> according to the
> > weight.  So I thought that instead of running the rpart 
> procedure with
> > weights, say, starting with (1, 0, 2, 0, 1, ... etc.)  I 
> could simply
> > generate a sample data set by retaining row 1, omitting row 2,
> replicating
> > row 3 twice, omitting row 4, retaining row 5, etc.  
> However, this does
> not
> > seem to work as I expected.  Instead of getting identical 
> trees (from
> > running weighted rpart on the original data set and running 
> rpart on the
> > sample data set described above with no weighting) I get 
> trees that are
> > completely different (different threshold values and 
> different order of
> > variables entering the splits).  Moreover,  the predictions 
> from these
> > trees can be different so the misclassification rates 
> usually differ.
> >
> > This finally brings me to my question - is there a way to mimic the
> > workings of the weighting in rpart by, for example, 
> modification of the
> > data set or, perhaps, some other means.
> >
> > Thanks in advance for your time,
> >
> > Andy
> >
> > __________________________________
> > Andy Jaworski
> > 518-1-01
> > Process Laboratory
> > 3M Corporate Research Laboratory
> > -----
> > E-mail: apjaworski at mmm.com
> > Tel:  (651) 733-6092
> > Fax:  (651) 736-3122
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> 
> --
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 


------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachme...{{dropped:15}}



More information about the R-help mailing list