[R] R memory and CPU requirements
Deepayan Sarkar
deepayan at stat.wisc.edu
Fri Oct 17 16:53:38 CEST 2003
On Friday 17 October 2003 03:33, Alexander Sirotkin \[at Yahoo\] wrote:
> > > > > One more (hopefully last one) : I've been very
> > > > > surprised when I tried to fit a model (using
> > > > > aov())
> > > > > for a sample of size 200 and 10 variables and
> > > > > their interactions.
> > > >
> > > > That doesn't really say much. How many of these
> > > > variables are factors ? How
> > > > many levels do they have ? And what is the order
> > > > of the interaction ? (Note
> > > > that for 10 numeric variables, if you allow all
> > > > interactions, then there will
> > > > be a 100 terms in your model. This increases for
> > > > factors.)
> > > >
> > > > In other words, how big is your model matrix ?
> > >
> > > I see...
> > >
> > > Unfortunately, model.matrix() ran out of memory :)
> > > I have 10 variables, 6 of which are factor, 2 of
> > which
> > > have quite a lot of levels (about 40). And I would
> > > like to allow all interactions.
> > >
> > > I understand your point about categorical
> >
> > variables,
> >
> > > but still - this does not seem like too much data
> >
> > to me.
> >
> > That's one way to look at it. You don't have enough
> > data for the model you are
> > trying to fit. The usual approach under these
> > circumstances is to try
> > 'simpler' models.
> >
> > Please try to understand what you are trying to do
> > (in this case by reading an
> > introductory linear model text) before blindly
> > applying a methodology.
> >
> > Deepayan
>
> I did study ANOVA and I do have enough observations.
> 200 was only a random sample of more then 5000 which I
> think should be enough. However, I'm afraid to even
> think about amount of RAM I will need with R to fit a
> model for this data.
Let's see. You have 10 variables, 6 of which are factors, 2 of which have at
least 40 levels, and you want all interactions. Let's conservatively estimate
that all the other four factors have only 2 levels.
> x1 = gl(40, 1, 1)
> x2 = gl(40, 1, 1)
> x3 = gl(2, 1, 1)
> x4 = gl(2, 1, 1)
> x5 = gl(2, 1, 1)
> x6 = gl(2, 1, 1)
> dim(model.matrix(~ x1 * x2 * x3 * x4 * x5 * x6))
[1] 1 25600
This was for one data point, increasing that would only increase the number of
rows, the columns would be the same. And of course, this is just for 6-way
interactions, and the least possible given the information you have given us
about your model. In actual fact, your model matrix will have many many more
columns.
I hope you realize that the number of columns in the model matrix is the
number of parameters you are trying to estimate. If your sample size is less
than this number (and 5000 is way less), then there will be infinitely many
solutions to this problem, each of which will fit your data perfectly. Do you
really want such an answer ? Assuming that you find one, what are you going
to do with it ?
I have no idea what made you choose such an high order model, but as Andy has
said, you really should try to figure out what exactly your goals are before
proceeding. If you believe that your data can really not be modeled
reasonably by anything simpler, you probably should not use a linear model at
all.
Hope that helps,
Deepayan
More information about the R-help
mailing list