[R] Why 'gbm' is not giving me error when I change the response from numeric to categorical?
Marc Schwartz
marc_schwartz at me.com
Fri Oct 4 21:34:11 CEST 2013
On Oct 4, 2013, at 2:16 PM, Mary Kindall <mary.kindall at gmail.com> wrote:
> This reproducible example is from the help of 'gbm' in R.
>
> I ran the following code in R, and works fine as long as the response is
> numeric. The problem starts when I convert the response from numeric to
> binary (0/1). It gives me an error.
>
> My question is, is converting the response from numeric to binary will have
> this much effect.
>
> Help page code:
>
> N <- 1000
> X1 <- runif(N)
> X2 <- 2*runif(N)
> X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
> X4 <- factor(sample(letters[1:6],N,replace=TRUE))
> X5 <- factor(sample(letters[1:3],N,replace=TRUE))
> X6 <- 3*runif(N)
> mu <- c(-1,0,1,2)[as.numeric(X3)]
>
> SNR <- 10 # signal-to-noise ratio
> Y <- X1**1.5 + 2 * (X2**.5) + mu
> sigma <- sqrt(var(Y)/SNR)
> Y <- Y + rnorm(N,0,sigma)
>
> # introduce some missing values
> X1[sample(1:N,size=500)] <- NA
> X4[sample(1:N,size=300)] <- NA
>
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
>
> # fit initial model
> gbm1 <-
> gbm(Y~X1+X2+X3+X4+X5+X6, # formula
> data=data, # dataset
> var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
> # +1: monotone increase,
> # 0: no monotone restrictions
> distribution="gaussian", # see the help for other choices
> n.trees=1000, # number of trees
> shrinkage=0.05, # shrinkage or learning rate,
> # 0.001 to 0.1 usually work
> interaction.depth=3, # 1: additive model, 2: two-way
> interactions, etc.
> bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
> best
> train.fraction = 0.5, # fraction of data for training,
> # first train.fraction*N used for training
> n.minobsinnode = 10, # minimum total weight needed in each
> node
> cv.folds = 3, # do 3-fold cross-validation
> keep.data=TRUE, # keep a copy of the dataset with the
> object
> verbose=FALSE) # don't print out progress
>
> gbm1
> summary(gbm1)
>
>
> Now I slightly change the response variable to make it binary.
>
> Y[Y < mean(Y)] = 0 #My edit
> Y[Y >= mean(Y)] = 1 #My edit
> data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
> fmla = as.formula(factor(Y)~X1+X2+X3+X4+X5+X6) #My edit
>
> gbm2 <-
> gbm(fmla, # formula
> data=data, # dataset
> distribution="bernoulli", # My edit
> n.trees=1000, # number of trees
> shrinkage=0.05, # shrinkage or learning rate,
> # 0.001 to 0.1 usually work
> interaction.depth=3, # 1: additive model, 2: two-way
> interactions, etc.
> bag.fraction = 0.5, # subsampling fraction, 0.5 is probably
> best
> train.fraction = 0.5, # fraction of data for training,
> # first train.fraction*N used for training
> n.minobsinnode = 10, # minimum total weight needed in each
> node
> cv.folds = 3, # do 3-fold cross-validation
> keep.data=TRUE, # keep a copy of the dataset with the
> object
> verbose=FALSE) # don't print out progress
>
> gbm2
>
>
>> gbm2
> gbm(formula = fmla, distribution = "bernoulli", data = data,
> n.trees = 1000, interaction.depth = 3, n.minobsinnode = 10,
> shrinkage = 0.05, bag.fraction = 0.5, train.fraction = 0.5,
> cv.folds = 3, keep.data = TRUE, verbose = FALSE)
> A gradient boosted model with bernoulli loss function.
> 1000 iterations were performed.
> The best cross-validation iteration was .
> The best test-set iteration was .
> Error in 1:n.trees : argument of length 0
>
>
> My question is, Is binarizing the response will have so much effect that it
> does not find anythin useful in the predictors?
>
> Thanks
Sure, it's possible. See this page for a good overview of why you should not dichotomize continuous data:
http://biostat.mc.vanderbilt.edu/wiki/Main/CatContinuous
Regards,
Marc Schwartz
More information about the R-help
mailing list