[R] Cforest and Random Forest memory use
Raubertas, Richard
richard_raubertas at merck.com
Fri Jun 18 01:15:08 CEST 2010
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Max Kuhn
> Sent: Monday, June 14, 2010 10:19 AM
> To: Matthew OKane
> Cc: r-help at r-project.org
> Subject: Re: [R] Cforest and Random Forest memory use
>
> The first thing that I would recommend is to avoid the "formula
> interface" to models. The internals that R uses to create matrices
> form a formula+data set are not efficient. If you had a large number
> of variables, I would have automatically pointed to that as a source
> of issues. cforest and ctree only have formula interfaces though, so
> you are stuck on that one. The randomForest package has both
> interfaces, so that might be better.
>
> Probably the issue is the depth of the trees. With that many
> observations, you are likely to get extremely deep trees. You might
> try limiting the depth of the tree and see if that has an effect on
> performance.
>
> We run into these issues with large compound libraries; in those cases
> we do whatever we can to avoid ensembles of trees or kernel methods.
> If you want those, you might need to write your own code that is
> hyper-efficient and tuned to your particular data structure (as we
> did).
>
> On another note... are this many observations really needed? You have
> 40ish variables; I suspect that >1M points are pretty densely packed
> into 40-dimensional space.
This did not seem right to me: 40-dimensional space is very, very big
and even a million observations will be thinly spread. There is probably
some analytic result from the theory of coverage processes about this,
but I just did a quick simulation. If a million samples are independently
and randomly distributed in a 40-d unit hypercube, then >90% of the points
in the hypercube will be more than one-quarter of the maximum possible
distance (sqrt(40)) from the nearest sample. And about 40% of the hypercube
will be more than one-third of the maximum possible distance to the nearest
sample. So the samples do not densely cover the space at all.
One implication is that modeling the relation of a response to 40 predictors
will inevitably require a lot of smoothing, even with a million data points.
Richard Raubertas
Merck & Co.
> Do you loose much by sampling the data set
> or allocating a large portion to a test set? If you have thousands of
> predictors, I could see the need for so many observations, but I'm
> wondering if many of the samples are redundant.
>
> Max
>
> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane
> <mlokane at gmail.com> wrote:
> > Answers added below.
> > Thanks again,
> > Matt
> >
> > On 11 June 2010 14:28, Max Kuhn <mxkuhn at gmail.com> wrote:
> >>
> >> Also, you have not said:
> >>
> >> - your OS: Windows Server 2003 64-bit
> >> - your version of R: 2.11.1 64-bit
> >> - your version of party: 0.9-9995
> >
> >
> >>
> >> - your code: test.cf <-(formula=badflag~.,data =
> >> example,control=cforest_control
> >
> > (teststat =
> 'max', testtype =
> > 'Teststatistic', replace = FALSE, ntree = 500,
> savesplitstats = FALSE,mtry =
> > 10))
> >
> >> - what "Large data set" means: > 1 million observations,
> 40+ variables,
> >> around 200MB
> >> - what "very large model objects" means - anything which breaks
> >>
> >> So... how is anyone suppose to help you?
> >>
> >> Max
> >
> >
>
>
>
> --
>
> Max
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
Notice: This e-mail message, together with any attachme...{{dropped:11}}
More information about the R-help
mailing list