[R-sig-eco] help with categorical responses in boosted classification trees (gbm package)

Kingsford Jones kingsfordjones at gmail.com
Sat Oct 11 19:40:30 CEST 2008


On Fri, Oct 10, 2008 at 10:41 AM, Jill Johnstone
<jill.johnstone at usask.ca> wrote:
> Hello,
>
> I am working on developing code for a boosted classification tree that
> predicts membership within 4 non-ordered classes, using the gbm or gbmplus
> packages in R. I've been successful (I think) in using this package
> successfully for regression trees, where the response is numeric. However,
> I'm running into problems setting up a boosted tree for a categorical
> response that is not simply a 0,1 response. In my case, the response is a
> non-ordered factor that represents different vegetation community types.

Are you sure that gbm is designed to handle multi-class responses
(i.e. >2 levels)?

> There are 4 factor levels and n=90 for the dataset.

90 observations may well not provide enough information to predict
four response levels (depending on the strengths of the relationships,
the number of observed successes, etc.)

>
> I think the problem may be that I am not specifying a proper error
> distribution. GBM help specifies the following options for this:
>
> "..."gaussian" (squared error), "laplace" (absolute loss), "bernoulli"
> (logistic regression for 0-1 outcomes), "adaboost" (the AdaBoost exponential
> loss for 0-1 outcomes), "poisson" (count outcomes), and "coxph" (censored
> observations)."
>
> I believe that the Gaussian error distribution is most appropriate for these
> data, and this is what I've been using. Below is the code that I am running:


Given the categorical response, squared error loss would not be a good
choice.  If gbm is designed to handle multi-category responses, the
'bernoulli or 'adaboost' options are more appropriate.


>
> tree1 <- gbm(veg ~ lat+elev+moist.class+BA.stnd+pre.decid,
>    data = natseed, n.tree=900, int=3, n.minobsinnode=5,
>    distribution="gaussian", shrinkage=0.003,
>    bag.fraction=0.5, cv.folds=5)
> all.summary(tree.1)
>
> And the error I am currently getting specifies a problem with the
> cross-validation, but I am not sure how to interpret this:
> "Error in if (x[[1]]$type != "cv") stop("Not a CV tree !!\n") : argument is
> of length zero"
>
> I'd really appreciate suggestions about where I might be going wrong, if
> anyone has any. I've been able to run this successfully as a regular
> classification tree using the "tree" library, but had hoped to apply the
> boosting approach. I've been referring to two excellent ecological papers
> that describe this technique, but neither deals with this type of
> classification tree:

Have a look at the randomForest package.  Also the following Google
search may help:

boosting "multi-class" OR "k-class" OR multicategorical OR multinomial

hth,

Kingsford Jones


> 1. De'ath, G. 2007. Boosted trees for ecological modeling and prediction.
> Ecology 88: 243-251.
> 2. Elith, J., Leathwick, J.R., and Hastie, T. 2008. A working guide to
> boosted regression trees. J. Animal Ecol. 77: 802-813.
>
> Thanks in advance for any suggestions.
>
> Jill Johnstone
> assistant professor
> Department of Biology
> University of Saskatchewan
> 112 Science Place
> Saskatoon SK S7N 5E2
> ph:(306)966-4421  fax:966-4461
> website: www.usask.ca/biology/johnstone/
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>



More information about the R-sig-ecology mailing list