[R] Cforest and Random Forest memory use

Bert Gunter gunter.berton at gene.com
Fri Jun 18 17:41:31 CEST 2010


Rich is right, of course. One way to think about it is this (parphrased from
the section on the "Curse of Dimensionality" from Hastie et al's
"Statistical Learning" Book): suppose 10 uniformly distributed points on a
line give what you consider to be adequate coverage of the line. Then in 40
dimensions, you'd need 10^40 uniformly distributed points to give equivalent
coverage.

Various other aspects of the curse of dimensionality are discussed in the
book, one of which is that in high dimensions, most points are closer to the
boundaries then to each other. As Rich indicates, this has profound
implications for what one can sensibly do with such data. On example is:
nearest neighbor procedures don't make much sense (as nobody is likely to
have anybody else nearby). Which Rich's little simulation nicely
demonstrated.

Cheers to all,

Bert Gunter
Genentech Nonclinical Statistics 



-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
Behalf Of Raubertas, Richard
Sent: Thursday, June 17, 2010 4:15 PM
To: Max Kuhn; Matthew OKane
Cc: r-help at r-project.org
Subject: Re: [R] Cforest and Random Forest memory use

 

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Max Kuhn
> Sent: Monday, June 14, 2010 10:19 AM
> To: Matthew OKane
> Cc: r-help at r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
> 
> The first thing that I would recommend is to avoid the "formula
> interface" to models. The internals that R uses to create matrices
> form a formula+data set are not efficient. If you had a large number
> of variables, I would have automatically pointed to that as a source
> of issues. cforest and ctree only have formula interfaces though, so
> you are stuck on that one. The randomForest package has both
> interfaces, so that might be better.
> 
> Probably the issue is the depth of the trees. With that many
> observations, you are likely to get extremely deep trees. You might
> try limiting the depth of the tree and see if that has an effect on
> performance.
> 
> We run into these issues with large compound libraries; in those cases
> we do whatever we can to avoid ensembles of trees or kernel methods.
> If you want those, you might need to write your own code that is
> hyper-efficient and tuned to your particular data structure (as we
> did).
> 
> On another note... are this many observations really needed? You have
> 40ish variables; I suspect that >1M points are pretty densely packed
> into 40-dimensional space. 

This did not seem right to me:  40-dimensional space is very, very big
and even a million observations will be thinly spread.  There is probably 
some analytic result from the theory of coverage processes about this, 
but I just did a quick simulation.  If a million samples are independently 
and randomly distributed in a 40-d unit hypercube, then >90% of the points 
in the hypercube will be more than one-quarter of the maximum possible 
distance (sqrt(40)) from the nearest sample.  And about 40% of the hypercube

will be more than one-third of the maximum possible distance to the nearest 
sample.  So the samples do not densely cover the space at all.

One implication is that modeling the relation of a response to 40 predictors

will inevitably require a lot of smoothing, even with a million data points.

Richard Raubertas
Merck & Co.

> Do you loose much by sampling the data set
> or allocating a large portion to a test set? If you have thousands of
> predictors, I could see the need for so many observations, but I'm
> wondering if many of the samples are redundant.
> 
> Max
> 
> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane 
> <mlokane at gmail.com> wrote:
> > Answers added below.
> > Thanks again,
> > Matt
> >
> > On 11 June 2010 14:28, Max Kuhn <mxkuhn at gmail.com> wrote:
> >>
> >> Also, you have not said:
> >>
> >>  - your OS: Windows Server 2003 64-bit
> >>  - your version of R: 2.11.1 64-bit
> >>  - your version of party: 0.9-9995
> >
> >
> >>
> >>  - your code:  test.cf <-(formula=badflag~.,data =
> >> example,control=cforest_control
> >
> >                                              (teststat = 
> 'max', testtype =
> > 'Teststatistic', replace = FALSE, ntree = 500, 
> savesplitstats = FALSE,mtry =
> > 10))
> >
> >>  - what "Large data set" means: > 1 million observations, 
> 40+ variables,
> >> around 200MB
> >>  - what "very large model objects" means - anything which breaks
> >>
> >> So... how is anyone suppose to help you?
> >>
> >> Max
> >
> >
> 
> 
> 
> -- 
> 
> Max
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list