[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?

Fri Oct 4 21:26:31 CEST 2013

"My question is, Is binarizing the response will have so much effect that it
does not find anythin useful in the predictors?"

Yes. Dichotomizing throws away most of the information in the data.
Which is why you shouldn't do it.

This is a statistics, not an R question, so any follow-up should be
posted on a statistical list like stats.stackexchange.com, not here.

-- Bert

On Fri, Oct 4, 2013 at 12:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:
> This reproducible example is from the help of 'gbm' in R.
>
> I ran the following code in R, and works fine as long as the response is
> numeric.  The problem starts when I convert the response from numeric to
> binary (0/1). It gives me an error.
>
> My question is, is converting the response from numeric to binary will have
> this much effect.
>
> Help page code:
>
> N <- 1000
> X1 <- runif(N)
> X2 <- 2*runif(N)
> X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
> X4 <- factor(sample(letters[1:6],N,replace=TRUE))
> X5 <- factor(sample(letters[1:3],N,replace=TRUE))
> X6 <- 3*runif(N)
> mu <- c(-1,0,1,2)[as.numeric(X3)]
>
> SNR <- 10 # signal-to-noise ratio
> Y <- X1**1.5 + 2 * (X2**.5) + mu
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(N,0,sigma)
>
> # introduce some missing values
> X1[sample(1:N,size=500)] <- NA
> X4[sample(1:N,size=300)] <- NA
>
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
>
> # fit initial model
> gbm1 <-
>   gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
>       data=data,                   # dataset
>       var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
>       # +1: monotone increase,
>       #  0: no monotone restrictions
>       distribution="gaussian",     # see the help for other choices
>       n.trees=1000,                # number of trees
>       shrinkage=0.05,              # shrinkage or learning rate,
>       # 0.001 to 0.1 usually work
>       interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>       bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>       train.fraction = 0.5,        # fraction of data for training,
>       # first train.fraction*N used for training
>       n.minobsinnode = 10,         # minimum total weight needed in each
> node
>       cv.folds = 3,                # do 3-fold cross-validation
>       keep.data=TRUE,              # keep a copy of the dataset with the
> object
>       verbose=FALSE)               # don't print out progress
>
> gbm1
> summary(gbm1)
>
>
> Now I slightly change the response variable to make it binary.
>
> Y[Y < mean(Y)] = 0   #My edit
> Y[Y >= mean(Y)] = 1  #My edit
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
>
> gbm2 <-
>   gbm(fmla,                        # formula
>       data=data,                   # dataset
>       distribution="bernoulli",     # My edit
>       n.trees=1000,                # number of trees
>       shrinkage=0.05,              # shrinkage or learning rate,
>       # 0.001 to 0.1 usually work
>       interaction.depth=3,         # 1: additive model, 2: two-way
> interactions, etc.
>       bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably
> best
>       train.fraction = 0.5,        # fraction of data for training,
>       # first train.fraction*N used for training
>       n.minobsinnode = 10,         # minimum total weight needed in each
> node
>       cv.folds = 3,                # do 3-fold cross-validation
>       keep.data=TRUE,              # keep a copy of the dataset with the
> object
>       verbose=FALSE)               # don't print out progress
>
> gbm2
>
>
>> gbm2
> gbm(formula = fmla, distribution = "bernoulli", data = data,
>     n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
>     shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
>     cv.folds = 3, keep.data = TRUE, verbose = FALSE)
> A gradient boosted model with bernoulli loss function.
> 1000 iterations were performed.
> The best cross-validation iteration was .
> The best test-set iteration was .
> Error in 1:n.trees : argument of length 0
>
>
> My question is, Is binarizing the response will have so much effect that it
> does not find anythin useful in the predictors?
>
> Thanks
>
> --
> -------------
> Mary Kindall
> Yorktown Heights, NY
> USA
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374