[R] memory limit in aov

Liaw, Andy andy_liaw at merck.com
Thu Feb 2 15:34:12 CET 2006


I don't know what the goal of the analysis is, but I have a suspicion that
the `gbm' package might be a more fruitful way...

Cheers,
Andy

From: Lucy Crooks
> 
> Thanks for your reply.
> 
> Thanks for info on aov-hadn't been able to tell which to use from  
> help pages. There are no random effects so will switch to lm().
> 
> The data are amino acid sequences, with factor being position and  
> level which amino acid is present. There are indeed an average of  
> around 8 per position (from 2 to 20). I don't think I can collapse  
> the levels at least to start with as I don't know in advance which  
> effect fitness (the y variable).
> 
>  From what you say R should be able to do the smaller analysis. So  
> have increased the RAM and will try this again.
> 
> Lucy Crooks
> 
> On Feb 1, 2006, at 3:45 PM, Peter Dalgaard wrote:
> > You do not want to use aov() on unbalanced data, and 
> especially not on
> > large data sets if random effects are involved. Rather, you need to
> > look at lmer() or just lm() if no random effects are present.
> >
> > However, even so, if you really have 29025 parameters to estimate, I
> > think you're out of luck. 8 billion (US) elements is 64G 
> and R is not
> > able to handle objects of that size - the limit is that the 
> size must
> > fit in a 32 bit integer (about 2 billion elements).
> >
> > A quick calculation suggests that your factors have around 8 levels
> > each. Is that really necessary, or can you perhaps collapse some
> > levels?
> >
> >
> >
> > -- 
> >    O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
> >   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
> >  (*) \(*) -- University of Copenhagen   Denmark          
> Ph:  (+45)  
> > 35327918
> > ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  
> FAX: (+45)  
> > 35327907
> 
> 
> > Lucy Crooks <Lucy.Crooks at env.ethz.ch> writes:
> >> I want to do an unbalanced anova on 272,992 observations with 405
> >> factors including 2-way interactions between 1 of these factors and
> >> the other 404. After fitting only 11 factors and their 
> interactions I
> >> get error messages like:
> >>
> >> Error: cannot allocate vector of size 1433066 Kb
> >> R(365,0xa000ed68) malloc: *** vm_allocate(size=1467461632) failed
> >> (error code=3)
> >> R(365,0xa000ed68) malloc: *** error: can't allocate region
> >> R(365,0xa000ed68) malloc: *** set a breakpoint in szone_error to  
> >> debug
> >>
> >> I think that the anova involves a matrix of 272,992 rows by 29025
> >> columns (using dummy variables)=7,900 million elements. I realise
> >> this is a lot! Could I solve this if I had more RAM or is 
> it just too
> >> big?
> >>
> >> Another possibility is to do 16 separate analyses on 17,062
> >> observations with 404 factors (although statistically I think the
> >> first approach is preferable). I get similar error messages then:
> >>
> >> Error: cannot allocate vector of size 175685 Kb
> >> R(365,0xa000ed68) malloc: *** vm_allocate(size=179904512) failed
> >> (error code=3)
> >>
> >> I think this analysis requires a 31 million element matrix.
> >>
> >> I am using R version 2.2.1 on a Mac G5 with 1 GB RAM running OS
> >> 10.4.4. Can somebody tell me what the limitations of my machine (or
> >> R) are likely to be? Whether this smaller analysis is feasible? and
> >> if so how much more memory I might require?
> >>
> >> The data is in R in a data frame of 272,992 rows by 406 columns. I
> >> would really appreciate any helpful input.
> >>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>




More information about the R-help mailing list