[R] biglm: how it handles large data set?

Mon Nov 1 14:10:28 CET 2010

----------------------------------------
> Date: Sun, 31 Oct 2010 00:22:12 -0700
> From: tim.liu at netzero.net
> To: r-help at r-project.org
> Subject: [R] biglm: how it handles large data set?
>
>
>
> I am trying to figure out why 'biglm' can handle large data set...
>
> According to the R document - "biglm creates a linear model object that uses
> only p^2 memory for p variables. It can be updated with more data using
> update. This allows linear regression on data sets larger than memory."

I'm not sure anyone answered the question but let me make some
comments having done something similar with non-R code before and motivate
my earlier comments about "streaming" data into a stats widget.
Probably this creates a matrix of some sort with various moments/ 
sums-of-powers
of the data like IIRC what the stats books call "computing formulas."
Each new data point simply adds to the matrix
elements, it needn't be stored by itself- in the simple case of finding 
an average for example each data point just ads to N and a sum and 
you divide the two when finished. So, anyway, up to the limits
of the floating point implementation( when each new "y^n" is too small to 
add a non-zero delta to the current sum LOL) , you can keep updating the matrix
elements with very large data sets and your memory requirement is just
due to matrix elements not number of data points. Finally you invert
the matrix to get your "answer." The ordere you quote seem about
right IIRC as I tried to fit some image related data to a polynomial.
You can probably just write the equations yourself, rearrange terms to
express as sums over past data, and see that your coefficients come from
the matrix inverse. 

>
> After reading the source code below， I still could not figure out how
> 'update' implements the algorithm...
>
> Thanks for any light shed upon this ...
>
> > biglm::biglm
>
> function (formula, data, weights = NULL, sandwich = FALSE)
> {
> tt <- terms(formula)
> if (!is.null(weights)) {
> if (!inherits(weights, "formula"))
> stop("`weights' must be a formula")
> w <- model.frame(weights, data)[[1]]
> }
> else w <- NULL
> mf <- model.frame(tt, data)
> mm <- model.matrix(tt, mf)
> qr <- bigqr.init(NCOL(mm))
> qr <- update(qr, mm, model.response(mf), w)
> rval <- list(call = sys.call(), qr = qr, assign = attr(mm,
> "assign"), terms = tt, n = NROW(mm), names = colnames(mm),
> weights = weights)
> if (sandwich) {
> p <- ncol(mm)
> n <- nrow(mm)
> xyqr <- bigqr.init(p * (p + 1))
> xx <- matrix(nrow = n, ncol = p * (p + 1))
> xx[, 1:p] <- mm * model.response(mf)
> for (i in 1:p) xx[, p * i + (1:p)] <- mm * mm[, i]
> xyqr <- update(xyqr, xx, rep(0, n), w * w)
> rval$sandwich <- list(xy = xyqr)
> }
> rval$df.resid <- rval$n - length(qr$D)
> class(rval) <- "biglm"
> rval
> }
> 
> ---------------------------
> --
> View this message in context: http://r.789695.n4.nabble.com/biglm-how-it-handles-large-data-set-tp3020890p3020890.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.