[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

Marc Schwartz marc_schwartz at me.com
Fri Oct 4 21:34:11 CEST 2013


On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:

> This reproducible example is from the help of 'gbm' in R.
> 
> I ran the following code in R, and works fine as long as the response is
> numeric.  The problem starts when I convert the response from numeric to
> binary (0/1). It gives me an error.
> 
> My question is, is converting the response from numeric to binary will have
> this much effect.
> 
> Help page code:
> 
> N <- 1000
> X1 <- runif(N)
> X2 <- 2*runif(N)
> X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
> X4 <- factor(sample(letters[1:6],N,replace=TRUE))
> X5 <- factor(sample(letters[1:3],N,replace=TRUE))
> X6 <- 3*runif(N)
> mu <- c(-1,0,1,2)[as.numeric(X3)]
> 
> SNR <- 10 # signal-to-noise ratio
> Y <- X1**1.5 + 2 * (X2**.5) + mu
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(N,0,sigma)
> 
> # introduce some missing values
> X1[sample(1:N,size=500)] <- NA
> X4[sample(1:N,size=300)] <- NA
> 
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> 
> # fit initial model
> gbm1 <-
>  gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
>      data=data,                   # dataset
>      var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
>      # +1: monotone increase,
>      #  0: no monotone restrictions
>      distribution="gaussian",     # see the help for other choices
>      n.trees=1000,                # number of trees
>      shrinkage=0.05,              # shrinkage or learning rate,
>      # 0.001 to 0.1 usually work
>      interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>      train.fraction = 0.5,        # fraction of data for training,
>      # first train.fraction*N used for training
>      n.minobsinnode = 10,         # minimum total weight needed in each
> node
>      cv.folds = 3,                # do 3-fold cross-validation
>      keep.data=TRUE,              # keep a copy of the dataset with the
> object
>      verbose=FALSE)               # don't print out progress
> 
> gbm1
> summary(gbm1)
> 
> 
> Now I slightly change the response variable to make it binary.
> 
> Y[Y < mean(Y)] = 0   #My edit
> Y[Y >= mean(Y)] = 1  #My edit
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
> 
> gbm2 <-
>  gbm(fmla,                        # formula
>      data=data,                   # dataset
>      distribution="bernoulli",     # My edit
>      n.trees=1000,                # number of trees
>      shrinkage=0.05,              # shrinkage or learning rate,
>      # 0.001 to 0.1 usually work
>      interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>      bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>      train.fraction = 0.5,        # fraction of data for training,
>      # first train.fraction*N used for training
>      n.minobsinnode = 10,         # minimum total weight needed in each
> node
>      cv.folds = 3,                # do 3-fold cross-validation
>      keep.data=TRUE,              # keep a copy of the dataset with the
> object
>      verbose=FALSE)               # don't print out progress
> 
> gbm2
> 
> 
>> gbm2
> gbm(formula = fmla, distribution = "bernoulli", data = data,
>    n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
>    shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
>    cv.folds = 3, keep.data = TRUE, verbose = FALSE)
> A gradient boosted model with bernoulli loss function.
> 1000 iterations were performed.
> The best cross-validation iteration was .
> The best test-set iteration was .
> Error in 1:n.trees : argument of length 0
> 
> 
> My question is, Is binarizing the response will have so much effect that it
> does not find anythin useful in the predictors?
> 
> Thanks



Sure, it's possible. See this page for a good overview of why you should not dichotomize continuous data:

  http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous

Regards,

Marc Schwartz



More information about the R-help mailing list