[R] gbm

Liaw, Andy andy_liaw at merck.com
Thu Jan 13 01:53:52 CET 2005


> From: Weiwei Shi
> 
> Hi, there:
> Thanks a lot for all people' prompt replies.
> 
> In detail, I am facing a huge amount of data: over
> 10,000 and 400 vars. This project is very challenging
> and interesting to me. I tried rpart which gives me
> some promising results but not good enough. So I am
> trying randomForest and gbm now. 
> 
> My plan of using gbm is like this:
> rt<-rpart(...)
> gbm(formula(rt)...)
> 
> Does this work? (My first question)

Given a machine with sufficient memory and CPU speed, yes.
 
> My another CONCERN FOR GBM is the scalability since I
> realize R seems to load all the data into memory. (My
> second question)

We have dealt with data larger than what you described.  One thing to avoid
is the use of the formula interface if you have _lots_ (like, hundreds) of
variables.  gbm.fit(), I believe, was created for that reason.
 
> But I believe the idea above will run very slowly. (I
> think I might try TreeNet, though I don't like it
> since it is commercial.). BTW, sampling might be a
> good idea, but it does not seem a good idea for my
> project from previous experiments.

To me being commercial is not a crime.  I judge software on quality, ease of
use, access to source (if I need it), etc.  To me, TreeNet failed on several
of those criteria, but it works just fine for some people.
 
> I read some reference mentioned earlier by helpers
> before I sent my first email. But I still appreciate
> any helps. You guys are so nice!

That's no excuse for not following the posting guide, right?
 
> BTW, gbm means gradient boosting modeling :)

No.  I believe Greg calls it `generalized boosting models'.

Andy

 
> Ed
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list