[R] lean and mean lm/glm?

Wed Aug 23 17:06:43 CEST 2006

Thomas Lumley wrote:

> No, it is quite straightforward if you are willing to make multiple passes 
> through the data. It is hard with a single pass and may not be possible 
> unless the data are in random order.
> 
> Fisher scoring for glms is just an iterative weighted least squares 
> calculation using a set of 'working' weights and 'working' response. These 
> can be defined chunk by chunk and fed to biglm. Three iterations should 
> be sufficient.

(NB: Although not stated clearly I was referring to single pass when I wrote "impossible"). Doing as you suggest with multiple passes would entail either sticking the database input calls into the main iterative loop of a lookalike glm.fit or lumping the user with a very unattractive sequence of calls:

big_glm.init
iterate:
load_data_chunk
big_glm.newiter
iterate: #could use a subset of the chunks on the first few go rounds
load_data_chunk
update.big_glm
big_glm.check_convergence #would also need to do coefficient adjustments if convergence is failing

Because most (if not all) of my data can fit into memory anyway, I propose simply doing the calcs in a modified glm.fit in chunks (i.e. by subsetting the X and y data matrices within the loops) with a user defined chunk length. I can always add database input calls later to handle exceptionally large datasets.

If one of you has a better suggestion I'm willing to hear it.

So far, I have hacked out a lot of the (in my view) extraneous stuff from glm and halved its memory usage. I can now run a 12 variable, 1 million observation data set using "only" 200Mb of working memory (excluding the memory required to store the data). Previously fit.glm was using 500Mb to do the same. To get convergence took 9 iterations (either way). To reiterate: the inefficiency is in calculating estimates, not in storing data.