[R] Linear models over large datasets

Daniel Lakeland dlakelan at street-artists.org
Fri Aug 17 20:16:22 CEST 2007


On Fri, Aug 17, 2007 at 01:53:25PM -0400, Ravi Varadhan wrote:
> The simplest trick is to use the QR decomposition:
> 
> The OLS solution (X'X)^{-1}X'y can be easily computed as:
> qr.solve(X, y)

While I agree that this is the correct way to solve the linear algebra
problem, I seem to be missing the reason why re-inventing the existing
lm function (which undoubtedly uses a QR decomposition internally)
will solve the problem that was mentioned, namely the massive amount
of memory that the process consumes?

2e6 rows by 200 columns by 8 bytes per double = 3 gigs minimum memory
consumption. The QR decomposition process, or any other solving
process will at least double this to 6 gigs, and it would be
unsurprising to have the overhead cause the whole thing to reach 8
gigs at the peak memory usage.

I'm going to assume that the original user has perhaps 1.5 gigs to 2
gigs available, so any process that even READS IN a matrix of more
than about 1 million rows will exceed the available memory. Hence, my
suggestion to randomly downsample the matrix by a factor of 10, and
then bootstrap the coefficients by repeating the downsampling process
20, 50, or 100 times to take advantage of all of the data available.

Now that I'm aware of the biglm package, I think that it is probably
preferrable.

-- 
Daniel Lakeland
dlakelan at street-artists.org
http://www.street-artists.org/~dlakelan



More information about the R-help mailing list