[R] Gradient Boosting Trees with correlated predictors in gbm

Liaw, Andy andy_liaw at merck.com
Tue Mar 2 20:43:04 CET 2010


In most implementations of boosting, and for that matter, single tree,
the first variable wins when there are ties.  In randomForest the
variables are sampled, and thus not tested in the same order from one
node to the next, thus the variables are more likely to "share the
glory".

Best,
Andy 

From: Patrick Connolly
> 
> On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote:
> 
> |> In theory, the choice between two perfectly correlated 
> predictors is
> |> random. Therefore, the importance should be "diluted" by half.
> |> However, this is implementation dependent.
> |> 
> |> For example, run this:
> |> 
> |>   set.seed(1)
> |>   n <- 100
> |>   p <- 10
> |> 
> |>   data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
> |>   data$dup <- data[, p-1]
> |> 
> |>   data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)
> |> 
> |>   data <- data[, sample(1:ncol(data))]
> |> 
> |>   str(data)
> |> 
> |>   library(gbm)
> |>   fit <- gbm(y~., data = data,
> |>              distribution = "gaussian",
> |>              interaction.depth = 10,
> |>              n.trees = 100,
> |>              verbose = FALSE)
> |>   summary(fit)
> 
> What happens if there's a third?
> 
> 
> > data$DUP <-data$dup 
> >  fit <- gbm(y~., data = data,
> +              distribution = "gaussian",
> +              interaction.depth = 10,
> +              n.trees = 100,
> +              verbose = FALSE)
> >   summary(fit)
>    var     rel.inf
> 1  DUP 55.98653321
> 2  dup 42.99934344
> 3   V2  0.30763599
> 4   V1  0.17108839
> 5   V4  0.14272470
> 6   V3  0.13069450
> 7   V6  0.07839121
> 8   V7  0.07109805
> 9   V5  0.06080096
> 10  V8  0.05168955
> 11  V9  0.00000000
> > 
> 
> So V9 which was identical to dup has now gone off the radar 
> altogether.
> 
> At first I thought that might be because 100 trees wasn't nearly
> enough, so I increased it to 6000 and added in some cross-validation.
> Doing a summary at the optimal number of trees still gives a similar
> result.
> 
> I have to admit to being somewhat puzzled.
> 
> 
> -- 
> ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
> ~.~.~.~.~.   
>    ___    Patrick Connolly   
>  {~._.~}                   Great minds discuss ideas    
>  _( Y )_  	         Average minds discuss events 
> (:_~*~_:)                  Small minds discuss people  
>  (_)-(_)  	                      ..... Eleanor Roosevelt
> 	  
> ~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
> ~.~.~.~.~.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:10}}



More information about the R-help mailing list