[R] Gradient Boosting Trees with correlated predictors in gbm

Patrick Connolly p_connolly at slingshot.co.nz
Tue Mar 2 20:32:41 CET 2010


On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote:

|> In theory, the choice between two perfectly correlated predictors is
|> random. Therefore, the importance should be "diluted" by half.
|> However, this is implementation dependent.
|> 
|> For example, run this:
|> 
|>   set.seed(1)
|>   n <- 100
|>   p <- 10
|> 
|>   data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
|>   data$dup <- data[, p-1]
|> 
|>   data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)
|> 
|>   data <- data[, sample(1:ncol(data))]
|> 
|>   str(data)
|> 
|>   library(gbm)
|>   fit <- gbm(y~., data = data,
|>              distribution = "gaussian",
|>              interaction.depth = 10,
|>              n.trees = 100,
|>              verbose = FALSE)
|>   summary(fit)

What happens if there's a third?


> data$DUP <-data$dup 
>  fit <- gbm(y~., data = data,
+              distribution = "gaussian",
+              interaction.depth = 10,
+              n.trees = 100,
+              verbose = FALSE)
>   summary(fit)
   var     rel.inf
1  DUP 55.98653321
2  dup 42.99934344
3   V2  0.30763599
4   V1  0.17108839
5   V4  0.14272470
6   V3  0.13069450
7   V6  0.07839121
8   V7  0.07109805
9   V5  0.06080096
10  V8  0.05168955
11  V9  0.00000000
> 

So V9 which was identical to dup has now gone off the radar altogether.

At first I thought that might be because 100 trees wasn't nearly
enough, so I increased it to 6000 and added in some cross-validation.
Doing a summary at the optimal number of trees still gives a similar
result.

I have to admit to being somewhat puzzled.


-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.



More information about the R-help mailing list