[R] Cforest and Random Forest memory use

Fri Jun 18 01:15:08 CEST 2010

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Max Kuhn
> Sent: Monday, June 14, 2010 10:19 AM
> To: Matthew OKane
> Cc: r-help at r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
> 
> The first thing that I would recommend is to avoid the "formula
> interface" to models. The internals that R uses to create matrices
> form a formula+data set are not efficient. If you had a large number
> of variables, I would have automatically pointed to that as a source
> of issues. cforest and ctree only have formula interfaces though, so
> you are stuck on that one. The randomForest package has both
> interfaces, so that might be better.
> 
> Probably the issue is the depth of the trees. With that many
> observations, you are likely to get extremely deep trees. You might
> try limiting the depth of the tree and see if that has an effect on
> performance.
> 
> We run into these issues with large compound libraries; in those cases
> we do whatever we can to avoid ensembles of trees or kernel methods.
> If you want those, you might need to write your own code that is
> hyper-efficient and tuned to your particular data structure (as we
> did).
> 
> On another note... are this many observations really needed? You have
> 40ish variables; I suspect that >1M points are pretty densely packed
> into 40-dimensional space. 

This did not seem right to me:  40-dimensional space is very, very big
and even a million observations will be thinly spread.  There is probably 
some analytic result from the theory of coverage processes about this, 
but I just did a quick simulation.  If a million samples are independently 
and randomly distributed in a 40-d unit hypercube, then >90% of the points 
in the hypercube will be more than one-quarter of the maximum possible 
distance (sqrt(40)) from the nearest sample.  And about 40% of the hypercube 
will be more than one-third of the maximum possible distance to the nearest 
sample.  So the samples do not densely cover the space at all.

One implication is that modeling the relation of a response to 40 predictors 
will inevitably require a lot of smoothing, even with a million data points.

Richard Raubertas
Merck & Co.

> Do you loose much by sampling the data set
> or allocating a large portion to a test set? If you have thousands of
> predictors, I could see the need for so many observations, but I'm
> wondering if many of the samples are redundant.
> 
> Max
> 
> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane 
> <mlokane at gmail.com> wrote:
> > Answers added below.
> > Thanks again,
> > Matt
> >
> > On 11 June 2010 14:28, Max Kuhn <mxkuhn at gmail.com> wrote:
> >>
> >> Also, you have not said:
> >>
> >>  - your OS: Windows Server 2003 64-bit
> >>  - your version of R: 2.11.1 64-bit
> >>  - your version of party: 0.9-9995
> >
> >
> >>
> >>  - your code:  test.cf <-(formula=badflag~.,data =
> >> example,control=cforest_control
> >
> >                                              (teststat = 
> 'max', testtype =
> > 'Teststatistic', replace = FALSE, ntree = 500, 
> savesplitstats = FALSE,mtry =
> > 10))
> >
> >>  - what "Large data set" means: > 1 million observations, 
> 40+ variables,
> >> around 200MB
> >>  - what "very large model objects" means - anything which breaks
> >>
> >> So... how is anyone suppose to help you?
> >>
> >> Max
> >
> >
> 
> 
> 
> -- 
> 
> Max
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}