Patrick Connolly
p_connolly at slingshot.co.nz
Sun Mar 7 08:30:58 CET 2010
On Tue, 02-Mar-2010 at 02:43PM -0500, Liaw, Andy wrote:
|> In most implementations of boosting, and for that matter, single tree,
|> the first variable wins when there are ties. In randomForest the
That still doesn't explain why with gbm, two identical variables will
"share the glory" (approximately evenly), but if there's a third, one
will be ignored completely. Clearly the two will not be even, but why
a third should be 0 to a fairly large number of digits seems strange.
If it ignored a second, it would be less mysterious.
It doesn't affect the validity of the method since three identical
columns would never arise, but it would be good to understand why such
a thing happens.
|> variables are sampled, and thus not tested in the same order from one
|> node to the next, thus the variables are more likely to "share the
|> glory".
|>
|> Best,
|> Andy
|>
|> From: Patrick Connolly
|> >
|> > On Mon, 01-Mar-2010 at 12:01PM -0500, Max Kuhn wrote:
|> >
|> > |> In theory, the choice between two perfectly correlated
|> > predictors is
|> > |> random. Therefore, the importance should be "diluted" by half.
|> > |> However, this is implementation dependent.
|> > |>
|> > |> For example, run this:
|> > |>
|> > |> set.seed(1)
|> > |> n <- 100
|> > |> p <- 10
|> > |>
|> > |> data <- as.data.frame(matrix(rnorm(n*(p-1)), nrow = n))
|> > |> data$dup <- data[, p-1]
|> > |>
|> > |> data$y <- 2 + 4 * data$dup - 2 * data$dup^2 + rnorm(n)
|> > |>
|> > |> data <- data[, sample(1:ncol(data))]
|> > |>
|> > |> str(data)
|> > |>
|> > |> library(gbm)
|> > |> fit <- gbm(y~., data = data,
|> > |> distribution = "gaussian",
|> > |> interaction.depth = 10,
|> > |> n.trees = 100,
|> > |> verbose = FALSE)
|> > |> summary(fit)
|> >
|> > What happens if there's a third?
|> >
|> >
|> > > data$DUP <-data$dup
|> > > fit <- gbm(y~., data = data,
|> > + distribution = "gaussian",
|> > + interaction.depth = 10,
|> > + n.trees = 100,
|> > + verbose = FALSE)
|> > > summary(fit)
|> > var rel.inf
|> > 1 DUP 55.98653321
|> > 2 dup 42.99934344
|> > 3 V2 0.30763599
|> > 4 V1 0.17108839
|> > 5 V4 0.14272470
|> > 6 V3 0.13069450
|> > 7 V6 0.07839121
|> > 8 V7 0.07109805
|> > 9 V5 0.06080096
|> > 10 V8 0.05168955
|> > 11 V9 0.00000000
|> > >
|> >
|> > So V9 which was identical to dup has now gone off the radar
|> > altogether.
|> >
|> > At first I thought that might be because 100 trees wasn't nearly
|> > enough, so I increased it to 6000 and added in some cross-validation.
|> > Doing a summary at the optimal number of trees still gives a similar
|> > result.
|> >
|> > I have to admit to being somewhat puzzled.
|> >
|> >
|> >
