[R] Linear models over large datasets

Charles C. Berry cberry at tajo.ucsd.edu
Fri Aug 17 20:41:28 CEST 2007


The original complaint

> > What exactly are my options to improve this? I looked into the biglm
> > package but the problem with it is it uses update() function and is
> > therefore not transparent

is not warranted.

As usual, the source code is the best reference. It took about a minute 
to download  biglm_0.4.tar.gz, open it in emacs and browse thru to see 
this reference:

 	ALGORITHM AS274  APPL. STATIST. (1992) VOL.41, NO. 2

in biglm/src/boundedQRf.f which appears to incremetnally update an 
orthogonal decomposition of the design matrix, etc.

This seems VERY transparent.

It would seem quite easy to borrow the Fortran code and the wrappers that 
biglm provide and adapt them to some other purpose.

Chuck


On Fri, 17 Aug 2007, Ravi Varadhan wrote:

> The simplest trick is to use the QR decomposition:
>
> The OLS solution (X'X)^{-1}X'y can be easily computed as:
> qr.solve(X, y)
>
> Here is an illustration:
>
>> set.seed(123)
>> X <- matrix(round(rnorm(100),1),20,5)
>> b <- c(1,1,2,2,3)
>> y <- X %*% b + rnorm(20)
>>
>> ans1 <- solve(t(X)%*%X,t(X)%*%y)
>> ans2 <- qr.solve(X,y)
>> all.equal(ans1,ans2)
> [1] TRUE
>
> Ravi.
> ----------------------------------------------------------------------------
> -------
>
> Ravi Varadhan, Ph.D.
>
> Assistant Professor, The Center on Aging and Health
>
> Division of Geriatric Medicine and Gerontology
>
> Johns Hopkins University
>
> Ph: (410) 502-2619
>
> Fax: (410) 614-9625
>
> Email: rvaradhan at jhmi.edu
>
> Webpage:  http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html
>
>
>
> ----------------------------------------------------------------------------
> --------
>
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of dave fournier
> Sent: Friday, August 17, 2007 12:43 PM
> To: r-help at stat.math.ethz.ch
> Subject: [R] Linear models over large datasets
>
> >Its actually only a few lines of code to do this from first principles.
> >The coefficients depend only on the cross products X'X and X'y and you
> >can build them up easily by extending this example to read files or
> >a database holding x and y instead of getting them from the args.
> >Here we process incr rows of builtin matrix state.x77 at a time
> >building up the two cross productxts, xtx and xty, regressing
> >Income (variable 2) on the other variables:
>
> >mylm <- function(x, y, incr = 25) {
> >	start <- xtx <- xty <- 0
> >	while(start < nrow(x)) {
> >	    idx <- seq(start + 1, min(start + incr, nrow(x)))
> >	    x1 <- cbind(1, x[idx,])
> >	    xtx <- xtx + crossprod(x1)
> >	    xty <- xty + crossprod(x1, y[idx])
> >	    start <- start + incr
> >	}
> >	solve(xtx, xty)
> >}
>
> >mylm(state.x77[,-2], state.x77[,2])
>
>
> >On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:
> > I'd like to fit linear models on very large datasets. My data frames
> > are about 2000000 rows x 200 columns of doubles and I am using an 64
> > bit build of R. I've googled about this extensively and went over the
> > "R Data Import/Export" guide. My primary issue is although my data
> > represented in ascii form is 4Gb in size (therefore much smaller
> > considered in binary), R consumes about 12Gb of virtual memory.
> >
> > What exactly are my options to improve this? I looked into the biglm
> > package but the problem with it is it uses update() function and is
> > therefore not transparent (I am using a sophisticated script which is
> > hard to modify). I really liked the concept behind the  LM package
> > here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> > But it is no longer available. How could one fit linear models to very
> > large datasets without loading the entire set into memory but from a
> > file/database (possibly through a connection) using a relatively
> > simple modification of standard lm()? Alternatively how could one
> > improve the memory usage of R given a large dataset (by changing some
> > default parameters of R or even using on-the-fly compression)? I don't
> > mind much higher levels of CPU time required.
> >
> > Thank you in advance for your help.
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> If your design matrix X is very well behaved this approach may work for
> you. Often just doing solve(X'X,y) will fail for numerical reasons. The
> right way to do it is tofactor the matrix X  as
>
>           X = A * B
>
> where B is 200x200 in your case and A is  2000000 x 200 with
>   A'*A = I  (That is A is orthogonal.)
>
> so  X'*X = B' *B  and you use
>
>     solve(B'*B,y);
>
> To find A and B you can use modified Gram-Schmidt which is very easy to
> program and works well when you wish to store the columns of X on a hard
> disk and just read in a bit at a time. Some people claim that modifed
> Gram-Schmidt is unstable but it has always worked well for me.
> In any event you can check  to ensure that A'*A = I and
>   X=A*B
>
>       Cheers,
>
>        Dave
>
> -- 
> David A. Fournier
> P.O. Box 2040,
> Sidney, B.C. V8l 3S3
> Canada
> Phone/FAX 250-655-3364
> http://otter-rsch.com
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list