# [R] Linear models over large datasets

Gabor Grothendieck ggrothendieck at gmail.com
Fri Aug 17 00:28:42 CEST 2007

```Its actually only a few lines of code to do this from first principles.
The coefficients depend only on the cross products X'X and X'y and you
can build them up easily by extending this example to read files or
a database holding x and y instead of getting them from the args.
Here we process incr rows of builtin matrix state.x77 at a time
building up the two cross productxts, xtx and xty, regressing
Income (variable 2) on the other variables:

mylm <- function(x, y, incr = 25) {
start <- xtx <- xty <- 0
while(start < nrow(x)) {
idx <- seq(start + 1, min(start + incr, nrow(x)))
x1 <- cbind(1, x[idx,])
xtx <- xtx + crossprod(x1)
xty <- xty + crossprod(x1, y[idx])
start <- start + incr
}
solve(xtx, xty)
}

mylm(state.x77[,-2], state.x77[,2])

On 8/16/07, Alp ATICI <alpatici at gmail.com> wrote:
> I'd like to fit linear models on very large datasets. My data frames
> are about 2000000 rows x 200 columns of doubles and I am using an 64
> "R Data Import/Export" guide. My primary issue is although my data
> represented in ascii form is 4Gb in size (therefore much smaller
> considered in binary), R consumes about 12Gb of virtual memory.
>
> What exactly are my options to improve this? I looked into the biglm
> package but the problem with it is it uses update() function and is
> therefore not transparent (I am using a sophisticated script which is
> hard to modify). I really liked the concept behind the  LM package
> here: http://www.econ.uiuc.edu/~roger/research/rq/RMySQL.html
> But it is no longer available. How could one fit linear models to very
> large datasets without loading the entire set into memory but from a
> file/database (possibly through a connection) using a relatively
> simple modification of standard lm()? Alternatively how could one
> improve the memory usage of R given a large dataset (by changing some
> default parameters of R or even using on-the-fly compression)? I don't
> mind much higher levels of CPU time required.
>