[R] LM: Least Squares on Large Datasets OR why lm() is designed the w ay it is

Vadim Ogranovich vograno at arbitrade.com
Fri Aug 9 19:56:29 CEST 2002


I have always been wondering why S-Plus/R can not fit a linear model to an
arbitrary large data set given that, I thought, it should be pretty
straightforward. Sometime ago I came across a reference to LM package,
http://www.econ.uiuc.edu/~anovo/LM.html, by Roger Koenker and Alvaro Novo.
So I thought here it is at last, but to my surprise this project hasn't made
to the recommended packages and its development seems to be stopped. I take
it as a strong evidence that there is a conceptual problem in doing this
sort of things and I thought it would be very educational for me to
understand it.

Here is how I would structure lm object, please feel free to point mistakes

Suppose we want to analyze lm(Y ~ X), where Y is a vector and X is a matrix
1. Under the classical assumptions of normality and independence of the
residuals all information about the model is encapsulated in the covariance
matrix of [Y,X] and the observation count, i.e. length(Y). These include
variance of coefficients, their significance levels, ability to compute
predictions, etc. Moreover, all sub-models, i.e. a regression on any subset
of X columns are also readily computable, as well as ANOVA.
Given this I'd store the covmatrix of [Y,X] and the count on an lm object
and write summary.lm, anova.lm, step, stepAIC functions in terms of these
two members only.
I guess this is the idea behind the LM package.

2. There is whole lot of tests that are designed to check the classical
assumptions of normality of the residuals, detect influential points, etc.
Obviously these can not possibly be carried out without the residuals, etc.
So the lm object should provide a slot for the residuals, but whether the
residuals are in fact computed should not affect the functions mentioned in
the previous paragraph.

I will appreciate any comment on this "design".

Thanks, Vadim

This e-mail, and any attachments thereto, is intended only for use by the
addressee(s) named herein and may contain legally privileged and/or
confidential information.  If you are not the intended recipient of this
e-mail, you are hereby notified that any dissemination, distribution or
copying of this e-mail, and any attachments thereto, is strictly prohibited.
If you have received this e-mail in error, please immediately notify me and
permanently delete the original and any copy of any e-mail and any printout

E-mail transmission cannot be guaranteed to be secure or error-free.  The
sender therefore does not accept liability for any errors or omissions in
the contents of this message which arise as a result of e-mail transmission.

NOTICE regarding privacy and confidentiality 

Knight Trading Group may, at its discretion, monitor and review the content
of all e-mail communications. 

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list