[R] equipment

Peter Dalgaard BSA p.dalgaard at biostat.ku.dk
Wed Apr 23 20:34:15 CEST 2003

"Ruud H. Koning" <info at rhkoning.com> writes:

> Hello, it is likely that I will have to analyze a rather sizeable dataset:
> 60000 records, 10 to 15 variables. I will have to make descriptive
> statistics, and estimate linear models, glm's and maybe Cox proportional
> hazard model with time varying covariates. In theory, this is possible in
> R, but I would like to get some feedback on the equipment I should get for
> this. At this moment, I have a Pentium 3 laptop running windows 2000 with
> 384MB ram. What type of cpu-speed and/or how much memory should I get?
> Thanks for some ideas, Ruud

Except for the time-varying Cox thing, this doesn't seem too hard:

> d <- as.data.frame(matrix(rnorm(60000*15),60000,15))
> names(d)
 [1] "V1"  "V2"  "V3"  "V4"  "V5"  "V6"  "V7"  "V8"  "V9"  "V10" "V11" "V12"
[13] "V13" "V14" "V15"
> system.time(lm(V15~.,data=d))
[1] 2.62 0.61 3.24 0.00 0.00
> gc()
          used (Mb) gc trigger (Mb)
Ncells  431614 11.6     741108 19.8
Vcells 1079809  8.3    6817351 52.1

That's on the fastest machine I have access to, a 2.8GHz Xeon (Dual,
but not with threaded BLAS lib). About three times slower on a 900 MHz
PIII. For GLM you'll do similar  operations iterated say 5 times, and
if you have factors and interactions among your predictors, you'll get
essentially an increase proportional to the number of parameters in
the model. 

Time-dependent Cox in full generality has complexity proportional to
the square of the data set (one regression computation per death)
which could be prohibitive, but there are often simplifications,
depending on the nature of the time dependency.

   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907

More information about the R-help mailing list