[R] Gradient Boosting Trees with correlated predictors in gbm
Patrick Connolly
p_connolly at slingshot.co.nz
Tue Mar 2 20:32:41 CET 2010
On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote:
|> In theory, the choice between two perfectly correlated predictors is
|> random. Therefore, the importance should be "diluted" by half.
|> However, this is implementation dependent.
|>
|> For example, run this:
|>
|> set.seed(1)
|> n <- 100
|> p <- 10
|>
|> data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
|> data$dup <- data[, p-1]
|>
|> data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)
|>
|> data <- data[, sample(1:ncol(data))]
|>
|> str(data)
|>
|> library(gbm)
|> fit <- gbm(y~., data = data,
|> distribution = "gaussian",
|> interaction.depth = 10,
|> n.trees = 100,
|> verbose = FALSE)
|> summary(fit)
What happens if there's a third?
> data$DUP <-data$dup
> fit <- gbm(y~., data = data,
+ distribution = "gaussian",
+ interaction.depth = 10,
+ n.trees = 100,
+ verbose = FALSE)
> summary(fit)
var rel.inf
1 DUP 55.98653321
2 dup 42.99934344
3 V2 0.30763599
4 V1 0.17108839
5 V4 0.14272470
6 V3 0.13069450
7 V6 0.07839121
8 V7 0.07109805
9 V5 0.06080096
10 V8 0.05168955
11 V9 0.00000000
>
So V9 which was identical to dup has now gone off the radar altogether.
At first I thought that might be because 100 trees wasn't nearly
enough, so I increased it to 6000 and added in some cross-validation.
Doing a summary at the optimal number of trees still gives a similar
result.
I have to admit to being somewhat puzzled.
--
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
___ Patrick Connolly
{~._.~} Great minds discuss ideas
_( Y )_ Average minds discuss events
(:_~*~_:) Small minds discuss people
(_)-(_) ..... Eleanor Roosevelt
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.
More information about the R-help
mailing list