[R] Questions about biglm

John Fox jfox at mcmaster.ca
Thu Feb 19 18:02:23 CET 2009


Dear Greg and Dobo,

The vif() in the car package computes VIFs (and generalized VIFs) from the
covariance matrix of the coefficients; I'm not sure whether it will work
directly on objects produced by biglm() but if not it should be easily
adapted to do so.

I hope this helps,
 John

------------------------------
John Fox, Professor
Department of Sociology
McMaster University
Hamilton, Ontario, Canada
web: socserv.mcmaster.ca/jfox


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
On
> Behalf Of Greg Snow
> Sent: February-19-09 11:35 AM
> To: dobomode; r-help at r-project.org
> Subject: Re: [R] Questions about biglm
> 
> The idea of the biglm function is to only have part of the data in memory
at
> a time.  You read in part of the data and run biglm on that section of the
> data, then delete it from memory, load in the next part of the data and
use
> update to include the new data in the analysis, delete that, read in the
next
> group, run update, and repeat until you have processed all the data.  The
> result will then be the same as if you ran lm on the entire dataset
(possible
> slight differences due to rounding).  The bigglm function or code from
other
> packages (SQLiteDF for one) can automate this a bit more.
> 
> The code for VIF below uses the model.matrix command, this returns the x
> matrix for the analysis when used with an lm object. Since biglm is based
on
> the idea of not having all the data in memory at once, I would be very
> surprised if model.matrix worked with biglm objects, so that code is
unlikely
> to work as is.
> 
> One approach is to do VIF and other diagnostics on a subset of the data
> (random sample, stratified random sample) that fits easily into memory,
then
> after making decisions about the model based on the diagnostics, run the
> final model with biglm to get the precise results using the full data set.
> You can do the diagnostics on a couple different random subsets to confirm
> the decisions made.
> 
> Hope this helps,
> 
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > project.org] On Behalf Of dobomode
> > Sent: Wednesday, February 18, 2009 9:34 PM
> > To: r-help at r-project.org
> > Subject: [R] Questions about biglm
> >
> > Hello folks,
> >
> > I am very excited to have discovered R and have been exploring its
> > capabilities. R's regression models are of great interest to me as my
> > company is in the business of running thousands of linear regressions
> > on large datasets.
> >
> > I am using biglm to run linear regressions on datasets that are as
> > large as several GB's. I have been pleasantly surprised that biglm
> > runs the regressions extremely fast (one regression may take minutes
> > in SPSS vs seconds in R).
> >
> > I have been trying to wrap my head around biglm and have a couple of
> > questions.
> >
> > 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was
> > able to get VIF's from the regular lm function using this piece of
> > code I found through Google, but have not been able to adapt it to
> > work with biglm. Hasn't anyone been successful in this?
> >
> > vif.lm <- function(object, ...) {
> >   V <- summary(object)$cov.unscaled
> >   Vi <- crossprod(model.matrix(object))
> >         nam <- names(coef(object))
> >   if(k <- match("(Intercept)", nam, nomatch = F)) {
> >                 v1 <- diag(V)[-k]
> >                 v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
> >                 nam <- nam[-k]
> >         } else {
> >                 v1 <- diag(V)
> >                 v2 <- diag(Vi)
> >                 warning("No intercept term detected. Results may
> > surprise.")
> >         }
> >         structure(v1*v2, names = nam)
> > }
> >
> > 2. How reliable / stable is biglm's update() function? I was
> > experimenting with running regressions on individual chunks of my
> > large dataset, but the coefficients I got were different compared to
> > those obtained form running biglm on the whole dataset. Am I mistaken
> > when I say that update() is intended to run regressions in chunks
> > (when memory becomes an issue with datasets that are too large) and
> > produce identical results to running a single regression on the
> > dataset as a whole?
> >
> > Thanks!
> >
> > Dobo
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list