[R] Gradient Boosting Trees with correlated predictors in gbm

Patrick Connolly p_connolly at slingshot.co.nz
Sun Mar 7 08:30:58 CET 2010


On Tue, 02-Mar-2010 at 02:43PM -0500, Liaw, Andy wrote:

|> In most implementations of boosting, and for that matter, single tree,
|> the first variable wins when there are ties.  In randomForest the

That still doesn't explain why with gbm, two identical variables will
"share the glory" (approximately evenly), but if there's a third, one
will be ignored completely.  Clearly the two will not be even, but why
a third should be 0 to a fairly large number of digits seems strange.
If it ignored a second, it would be less mysterious.

It doesn't affect the validity of the method since three identical
columns would never arise, but it would be good to understand why such
a thing happens.





|> variables are sampled, and thus not tested in the same order from one
|> node to the next, thus the variables are more likely to "share the
|> glory".
|> 
|> Best,
|> Andy 
|> 
|> From: Patrick Connolly
|> > 
|> > On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote:
|> > 
|> > |> In theory, the choice between two perfectly correlated 
|> > predictors is
|> > |> random. Therefore, the importance should be "diluted" by half.
|> > |> However, this is implementation dependent.
|> > |> 
|> > |> For example, run this:
|> > |> 
|> > |>   set.seed(1)
|> > |>   n <- 100
|> > |>   p <- 10
|> > |> 
|> > |>   data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
|> > |>   data$dup <- data[, p-1]
|> > |> 
|> > |>   data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)
|> > |> 
|> > |>   data <- data[, sample(1:ncol(data))]
|> > |> 
|> > |>   str(data)
|> > |> 
|> > |>   library(gbm)
|> > |>   fit <- gbm(y~., data = data,
|> > |>              distribution = "gaussian",
|> > |>              interaction.depth = 10,
|> > |>              n.trees = 100,
|> > |>              verbose = FALSE)
|> > |>   summary(fit)
|> > 
|> > What happens if there's a third?
|> > 
|> > 
|> > > data$DUP <-data$dup 
|> > >  fit <- gbm(y~., data = data,
|> > +              distribution = "gaussian",
|> > +              interaction.depth = 10,
|> > +              n.trees = 100,
|> > +              verbose = FALSE)
|> > >   summary(fit)
|> >    var     rel.inf
|> > 1  DUP 55.98653321
|> > 2  dup 42.99934344
|> > 3   V2  0.30763599
|> > 4   V1  0.17108839
|> > 5   V4  0.14272470
|> > 6   V3  0.13069450
|> > 7   V6  0.07839121
|> > 8   V7  0.07109805
|> > 9   V5  0.06080096
|> > 10  V8  0.05168955
|> > 11  V9  0.00000000
|> > > 
|> > 
|> > So V9 which was identical to dup has now gone off the radar 
|> > altogether.
|> > 
|> > At first I thought that might be because 100 trees wasn't nearly
|> > enough, so I increased it to 6000 and added in some cross-validation.
|> > Doing a summary at the optimal number of trees still gives a similar
|> > result.
|> > 
|> > I have to admit to being somewhat puzzled.
|> > 
|> > 
|> > -- 
|> > ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
|> > ~.~.~.~.~.   
|> >    ___    Patrick Connolly   
|> >  {~._.~}                   Great minds discuss ideas    
|> >  _( Y )_  	         Average minds discuss events 
|> > (:_~*~_:)                  Small minds discuss people  
|> >  (_)-(_)  	                      ..... Eleanor Roosevelt
|> > 	  
|> > ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
|> > ~.~.~.~.~.
|> > 
|> > ______________________________________________
|> > R-help at r-project.org mailing list
|> > https://stat.ethz.ch/mailman/listinfo/r-help
|> > PLEASE do read the posting guide 
|> > http://www.R-project.org/posting-guide.html
|> > and provide commented, minimal, self-contained, reproducible code.
|> > 
|> Notice:  This e-mail message, together with any attachments, contains information of Merck & Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates Direct contact information for affiliates is available at http://www.merck.com/contact/contacts.html) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have received this message in error, please notify us immediately by reply e-mail and then delete it from your system.

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.



More information about the R-help mailing list