[R] Large data sets and memory management in R.

Wed Jan 28 22:18:39 CET 2004

gerald.jean at dgag.ca writes:

> library(package = "statmod", pos = 2,
>         lib.loc = "/home/jeg002/R-1.8.1/lib/R/R_LIBS")
> 
> qc.B3.tweedie <- glm(formula = pp20B3 ~ ageveh + anpol +
>                      categveh + champion + cie + dossiera +
>                      faq13c + faq5a + kmaff + kmprom + nbvt +
>                      rabprof + sexeprin + newage,
>                      family = tweedie(var.power = 1.577,
>                        link.power = 0),
>                      etastart = log(rep(mean(qc.b3.sans.occ[,
>                         'pp20B3']), nrow(qc.b3.sans.occ))),
>                      weights = unsb3t1,
>                      trace = T,
>                      data = qc.b3.sans.occ)
> 
> After one iteration (45+ minutes) R is trashing through over 10Gb of
> memory.
> 
> Thanks for any insights,

Well, I don't know how much it helps; you are in somewhat uncharted
territory there. I suppose the dataset comes to 0.5-1GB all by itself?

One thing that I note is that you have 60 variables, but use only 15.
Perhaps it helps to remove some of them before the run? 

How large does the designmatrix get? If some of those variables have a
lot of levels, it could explain the phenomenon. Any chance that a
continuous variable got recorded as a factor?

        -p

-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907