[R] lean and mean lm/glm?

Tue Aug 22 19:07:11 CEST 2006

On Mon, 21 Aug 2006, Damien Moore wrote:

>
>> For very large regression problems there is the biglm package (put you
>> data into a database, read in 500,000 rows at a time, and keep updating
>> the fit).
>
> thanks. I took a look at biglm and it seems pretty easy to use and, 
> looking at the source, avoids much of the redundancy of lm. Correct me 
> if i'm wrong, but I think it would be virtually impossible to extend to 
> glm, because of the non-linearity in glm models.
>
> I might hack around at the source code for glm.fit -- I think I can 
> avoid some of the redundancy involved in that routine pretty easily, but 
> it will mean rewriting the summary output code...

Damien,

If you know what is 'under the hood' of glm, you can use the biglm
approach to perform a one-step update of the coefficients of a glm model.

There is plenty of theory for one-step estimators that use consistent 
estimates as starting values.

You can probably get a good starting value by averaging all of the results 
returned by slicing the data set into smaller pieces and running glm.fit 
on each of them.

Chuck

>
> cheers
> Damien
>
>
> --- On Mon 08/21, Greg Snow < Greg.Snow at intermountainmail.org > wrote:From: Greg Snow [mailto: Greg.Snow at intermountainmail.org]To: damien.moore at excite.com, r-help at stat.math.ethz.chDate: Mon, 21 Aug 2006 12:01:06 -0600Subject: RE: [R] lean and mean lm/glm?
>
> For very large regression problems there is the biglm package (put you
> data into a database, read in 500,000 rows at a time, and keep updating
> the fit).
>
> This has not been extended to glm yet.
>
> Hope this helps,
>
>
> -- 
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at intermountainmail.org
> (801) 408-8111
>
>
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Damien Moore
> Sent: Monday, August 21, 2006 11:49 AM
> To: r-help at stat.math.ethz.ch
> Subject: [R] lean and mean lm/glm?
>
>
> Hi All: I'm new to R and have a few questions about getting R to run
> efficiently with large datasets.
>
> I'm running R on Windows XP with 1Gb ram (so about 600mb-700mb after the
> usual windows overhead). I have a dataset that has 4 million
> observations and about 20 variables. I want to run probit regressions on
> this data, but can't do this with more than about 500,000 observations
> before I start running out of ram (you could argue that I'm getting
> sufficient precision with <500,000 obs but lets pretend otherwise).
> Loading 500,000 observations into a data frame only takes about 100Mb of
> ram, so that isn't the problem. Instead it seems R uses huge amount of
> memory when running the glm methods. I called the Fortran routines that
> lm and glm use directly but even they create a large number of
> extraneous variables in the output (e.g. the Xs, ys, residuals etc) and
> during processing. For instance (sample code)
>
> x=runif(1000000)
> y=3*x+rnorm(1000000) #I notice this step chews up a lot more than the
> 7mb of ram required to store y during processing, but cleans up ok
> afterwards with a gc() call
> X=cbind(x)
> p=ncol(X)
> n=NROW(y)
> ny=NCOL(y)
> tol=1e-7
> #this is the fortran routine called by lm - regressing y on X here z <-
> .Fortran("dqrls", qr = X, n = n, p = p, y = y, ny = ny, tol =
> as.double(tol), coefficients = mat.or.vec(p, ny), residuals = y, effects
> = y, rank = integer(1), pivot = 1:p, qraux = double(p), work = double(2
> * p), PACKAGE = "base")
>
> This code runs very quickly - suggesting that in principle R should have
> no problem at all handling very large data sets, but uses >100mb during
> processing and z is about a 20mb object. Scaling this up to a much
> larger dataset with many variables its easy to see i'm going to run into
> problems
>
> My questions:
> 1. are there any memory efficient alternatives to lm/glm in R?
> 2. is there any way to prevent the Fortran routine "dqrls" from
> producing so much output? (I suspect not since its output has to be
> compatible with the summary method, which seems to rely on having a copy
> of all variables instead of just references to the relevant variables -
> correct me if i'm wrong on this) 3. failing 1 & 2 how easy would it be
> to create new versions of lm and glm that don't use so much memory? (Not
> that I'm volunteering or anything ;) ). There is no need to hold
> individual residuals in memory or make copies of the variables (at least
> for my purposes). How well documented is the source code?
>
> cheers
> Damien Moore
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>    [ Part 3.53: "Included Message" ]
>

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717