[R] R versus SAS: lm performance

Tue May 11 19:50:12 CEST 2004

Thanks All, for your help. There seems to be a lot I can try to speed up the fits. However, I'd like to go for a much simpler model which I think is justified  by the experiment itself, e.g; I may think about removing the nestinh "(Ba:Ti:Do)/Ar".

The model matrix has 1344 rows and 2970 columns, and the rank of the matrix is 504. Therefore I think I should reformulate the model.

I was just stroke my the massive difference in performance when my collegue told me about the difference between SAS and S+.

	kind regards,

	Arne

--
Arne Muller, Ph.D.
Toxicogenomics, Aventis Pharma
arne dot muller domain=aventis com

> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com]
> Sent: 11 May 2004 14:20
> To: Muller, Arne PH/FR; ripley at stats.ox.ac.uk
> Cc: r-help at stat.math.ethz.ch
> Subject: RE: [R] R versus SAS: lm performance
> 
> 
> I tried the following on an Opteron 248, R-1.9.0 w/Goto's BLAS:
> 
> > y <- matrix(rnorm(14000*1344), 1344)
> > x <- matrix(runif(1344*503),1344)
> > system.time(fit <- lm(y~x))
> [1] 106.00  55.60 265.32   0.00   0.00
> 
> The resulting fit object is over 600MB.  (The coefficient 
> compoent is a 504
> x 14000 matrix.)
> 
> If I'm not mistaken, SAS sweeps on the extended cross product 
> matrix to fit
> regression models.  That, I believe, in usually faster than doing QR
> decomposition on the model matrix itself, but there are 
> trade-offs.  You
> could try what Prof. Bates suggested.
> 
> Andy
> 
> > From: Arne.Muller at aventis.com
> > 
> > Hello,
> > 
> > thanks for your reply. I've now done the profiling, and I 
> > interpret that the most time is spend in the fortran routine(s):
> > 
> > Each sample represents 0.02 seconds.
> > Total run time: 920.219999999453 seconds.
> > 
> > Total seconds: time spent in function and callees.
> > Self seconds: time spent in function alone.
> > 
> >    %       total       %       self
> >  total    seconds     self    seconds    name
> > 100.00    920.22      0.02      0.16     "lm"
> >  99.96    919.88      0.10      0.88     "lm.fit"
> >  99.74    917.84     99.74    917.84     ".Fortran"
> >   0.07      0.66      0.02      0.14     "storage.mode<-"
> >   0.06      0.52      0.00      0.00     "eval"
> >   0.06      0.52      0.04      0.34     "as.double"
> >   0.02      0.22      0.02      0.22     "colnames<-"
> >   0.02      0.20      0.02      0.20     "structure"
> >   0.02      0.18      0.02      0.18     "model.matrix.default"
> >   0.02      0.18      0.02      0.18     "as.double.default"
> >   0.02      0.18      0.00      0.00     "model.matrix"
> >   0.01      0.08      0.01      0.08     "list"
> > 
> >    %       self        %       total
> >  self     seconds    total    seconds    name
> >  99.74    917.84     99.74    917.84     ".Fortran"
> >   0.10      0.88     99.96    919.88     "lm.fit"
> >   0.04      0.34      0.06      0.52     "as.double"
> >   0.02      0.22      0.02      0.22     "colnames<-"
> >   0.02      0.20      0.02      0.20     "structure"
> >   0.02      0.18      0.02      0.18     "as.double.default"
> >   0.02      0.18      0.02      0.18     "model.matrix.default"
> >   0.02      0.16    100.00    920.22     "lm"
> >   0.02      0.14      0.07      0.66     "storage.mode<-"
> >   0.01      0.08      0.01      0.08     "list"
> > 
> > I guess this actually means I cannot do anything about it ... 
> > other than maybe splitting the problem into different 
> > (independaent parts - which I actually may be able to).
> > 
> > Regarding the usage of lm.fit instead of lm, this might be a 
> > good idea, since I am using the same model.matrix for all 
> > fits! However, I'd need to recreate an lm object from the 
> > output, because I'd like to run the anova function on this. 
> > I'll first do some profiling on lm versus lm.fit for the 
> > 12,000 models ...
> > 
> > 	kind regards + thanks again for your help,
> > 
> > 	Arne
> > 
> > --
> > Arne Muller, Ph.D.
> > Toxicogenomics, Aventis Pharma
> > arne dot muller domain=aventis com
> > 
> > > -----Original Message-----
> > > From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk]
> > > Sent: 11 May 2004 09:08
> > > To: Muller, Arne PH/FR
> > > Cc: r-help at stat.math.ethz.ch
> > > Subject: Re: [R] R versus SAS: lm performance
> > > 
> > > 
> > > The way to time things in R is system.time().
> > > 
> > > Without knowing much more about your problem we can only 
> > > guess where R is 
> > > spending the time.  But you can find out by profiling -- see 
> > > `Writing R 
> > > Extensions'.
> > > 
> > > If you want multiple fits with the same design matrix (do 
> you?) you 
> > > could look at the code of lm and call lm.fit repeatedly yourself.
> > > 
> > > On Mon, 10 May 2004 Arne.Muller at aventis.com wrote:
> > > 
> > > > Hello,
> > > > 
> > > > A collegue of mine has compared the runtime of a linear 
> > > model + anova in SAS and S+. He got the same results, but SAS 
> > > took a bit more than a minute whereas S+ took 17 minutes. 
> > > I've tried it in R (1.9.0) and it took 15 min. Neither 
> > > machine run out of memory, and I assume that all machines 
> > > have similar hardware, but the S+ and SAS machines are on 
> > > windows whereas the R machine is Redhat Linux 7.2.
> > > > 
> > > > My question is if I'm doing something wrong (technically) 
> > > calling the lm routine, or (if not), how I can optimize the 
> > > call to lm or even using an alternative to lm. I'd like to 
> > > run about 12,000 of these models in R (for a gene expression 
> > > experiment - one model per gene, which would take far too long).
> > > > 
> > > > I've run the follwong code in R (and S+):
> > > > 
> > > > > options(contrasts=c('contr.helmert', 'contr.poly'))
> > > > 
> > > > The 1st colum is the value to be modeled, and the others 
> > > are factors.
> > > > 
> > > > > names(df.gene1data) <- c("Va", "Ba", "Ti", "Do", "Ar", "Pr")
> > > > > df[c(1:2,1343:1344),]
> > > >            Va    Do  Ti  Ba Ar    Pr
> > > > 1    2.317804 000mM 24h NEW  1     1
> > > > 2    2.495390 000mM 24h NEW  2     1
> > > > 8315 2.979641 025mM 04h PRG 83    16
> > > > 8415 4.505787 000mM 04h PRG 84    16
> > > > 
> > > > this is a dataframe with 1344 rows.
> > > > 
> > > > x <- Sys.time();
> > > > wlm <- lm(Va ~
> > > > 
> > > Ba+Ti+Do+Pr+Ba:Ti+Ba:Do+Ba:Pr+Ti:Do+Ti:Pr+Do:Pr+Ba:Ti:Do+Ba:Ti
> > > :Pr+Ba:Do:Pr+Ti:Do:Pr+Ba:Ti:Do:Pr+(Ba:Ti:Do)/Ar, data=df, 
> > singular=T);
> > > > difftime(Sys.time(), x)
> > > > 
> > > > Time difference of 15.33333 mins
> > > > 
> > > > > anova(wlm)
> > > > Analysis of Variance Table
> > > > 
> > > > Response: Va
> > > >              Df Sum Sq Mean Sq   F value    Pr(>F)    
> > > > Ba            2    0.1     0.1    0.4262  0.653133    
> > > > Ti            1    2.6     2.6   16.5055 5.306e-05 ***
> > > > Do            4    6.8     1.7   10.5468 2.431e-08 ***
> > > > Pr           15 5007.4   333.8 2081.8439 < 2.2e-16 ***
> > > > Ba:Ti         2    3.2     1.6    9.8510 5.904e-05 ***
> > > > Ba:Do         7    2.8     0.4    2.5054  0.014943 *  
> > > > Ba:Pr        30   80.6     2.7   16.7585 < 2.2e-16 ***
> > > > Ti:Do         4    8.7     2.2   13.5982 9.537e-11 ***
> > > > Ti:Pr        15    2.4     0.2    1.0017  0.450876    
> > > > Do:Pr        60   10.2     0.2    1.0594  0.358551    
> > > > Ba:Ti:Do      7    1.4     0.2    1.2064  0.296415    
> > > > Ba:Ti:Pr     30    5.6     0.2    1.1563  0.259184    
> > > > Ba:Do:Pr    105   14.2     0.1    0.8445  0.862262    
> > > > Ti:Do:Pr     60   14.8     0.2    1.5367  0.006713 ** 
> > > > Ba:Ti:Do:Pr 105   15.8     0.2    0.9382  0.653134    
> > > > Ba:Ti:Do:Ar  56   26.4     0.5    2.9434 2.904e-11 ***
> > > > Residuals   840  134.7     0.2                        
> > > > 
> > > > The corresponding SAS program from my collegue is:
> > > > 
> > > > proc glm data = "the name of the data set";
> > > > 
> > > > class B T D A P;
> > > > 
> > > > model V = B T D P B*T B*D B*P T*D T*P D*P B*T*D B*T*P B*D*P 
> > > T*D*P B*T*D*P A(B*T*D);
> > > > 
> > > > run;
> > > > 
> > > > Note, V = Va, B = Ba, T = Ti, D = Do, P = Pr, A = Ar of the 
> > > R-example
> > > 
> > > -- 
> > > Brian D. Ripley,                  ripley at stats.ox.ac.uk
> > > Professor of Applied Statistics,  
> http://www.stats.ox.ac.uk/~ripley/
> > > University of Oxford,             Tel:  +44 1865 272861 (self)
> > > 1 South Parks Road,                     +44 1865 272866 (PA)
> > > Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> > > 
> > >
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> > http://www.R-project.org/posting-guide.html
> > 
> > 
> 
> 
> --------------------------------------------------------------
> ----------------
> Notice:  This e-mail message, together with any attachments, 
> contains information of Merck & Co., Inc. (One Merck Drive, 
> Whitehouse Station, New Jersey, USA 08889), and/or its 
> affiliates (which may be known outside the United States as 
> Merck Frosst, Merck Sharp & Dohme or MSD and in Japan, as 
> Banyu) that may be confidential, proprietary copyrighted 
> and/or legally privileged. It is intended solely for the use 
> of the individual or entity named on this message.  If you 
> are not the intended recipient, and have received this 
> message in error, please notify us immediately by reply 
> e-mail and then delete it from your system.
> --------------------------------------------------------------
> ----------------
>