[R] Maximum number of variables allowed in a multiple linearregression model

Douglas Bates bates at stat.wisc.edu
Thu Feb 7 02:53:42 CET 2008


On Feb 6, 2008 11:28 AM, Tony Plate <tplate at acm.org> wrote:
> Bert Gunter wrote:
> > I strongly suggest you collaborate with a local statistician. I can think of
> > no circumstance where multiple regression on "hundreds of thousands of
> > variables" is anything more than a fancy random number generator.
>
> That sounds like a challenge!  What is the largest regression problem (in
> terms of numbers of variables) that people have encountered where it made
> sense to do some sort of linear regression (and gave useful results)?
> (Including multilevel and Bayesian techniques.)

I have fit linear and generalized linear models with hundreds of
thousands of coefficients but, of course, with a highly structured
model matrix and using sparse matrix techniques.  What is called the
Rasch model for analysis of item response data (e.g. correct/incorrect
responses by students to the items on a multiple-choice test) is a
generalized linear model with the students and the items as factors.

However, like Bert I would be very dubious of any attempt to fit a
linear regression model to 3000 variables that were not generated in a
systematic way.  Sounds like a massive, computer-fueled fishing
expedition (a.k.a. "data mining").


> However, the original poster did say "hundreds to thousands", which is
> smaller than "hundreds of thousands".  When I try a regression problem with
> 3,000 coefficients in R running under Windows XP 64 bit with 8Gb of memory
> on the machine and the /3Gb option active (i.e., R can get up to 3Gb), R
> 2.6.1 runs out of memory (apparently trying to duplicate the model matrix):
>
> R version 2.6.1 (2007-11-26)
> Copyright (C) 2007 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
>
>  > m <- 3000
>  > n <- m * 10
>  > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
>  > dim(x)
> [1] 30000  3000
>  > k <- sample(m, 10)
>  > y <- rowSums(x[,k]) + 10 * rnorm(n)
>  > fit <- lm.fit(y=y, x=x)
> Error: cannot allocate vector of size 686.6 Mb
>  > object.size(x)/2^20
> [1] 687.7787
>  > memory.size()
> [1] -2022.552
>  >
> and the Windows process monitor shows the peak memory usage for Rgui.exe at
> 2,137,923K.  But in a 64 bit version of R, I would be surprised if it was
> not possible to run this (given sufficient memory).
>
> However, R easily handles a slightly smaller problem:
>  > m <- 1000 # of variables
>  > n <- m * 10 # of rows
>  > k <- sample(m, 10)
>  > x <- matrix(rnorm(n*m), ncol=m, nrow=n,
> dimnames=list(paste("C",1:n,sep=""), paste("X",1:m,sep="")))
>  > y <- rowSums(x[,k]) + 10 * rnorm(n)
>  > fit <- lm.fit(y=y, x=x)
>  > # distribution of coefs that should be one vs zero
>  > round(rbind(one=quantile(fit$coefficients[k]),
> zero=quantile(fit$coefficients[-k])), digits=2)
>          0%   25%   50%  75% 100%
> one   0.94  0.98  1.04 1.10 1.18
> zero -0.30 -0.08 -0.01 0.06 0.29
>  >
>
> To echo Bert Gunter's cautions, one must be careful doing ordinary linear
> regression with large numbers of coefficients.  It does seem a little
> unlikely that there is sufficient data to get useful estimates of three
> thousand coefficients using linear regression in data managed in Excel
> (though I guess it could be possible using Excel 12.0, which can handle up
> to 1 million rows - recent versions prior to 2008 could handle on 64K rows
> - see http://en.wikipedia.org/wiki/Microsoft_Excel#Versions ).  So, the
> suggestion to consult a local statistician is good advice - there may be
> other more suitable approaches, and if some form of linear regression is an
> appropriate approach, there are things to do to gain confidence that the
> results of the linear regression convey useful information.
>
> -- Tony Plate
>
>
> >
> > -- Bert Gunter
> > Genentech Nonclinical Statistics
> >
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> > Behalf Of Michelle Chu
> > Sent: Tuesday, February 05, 2008 9:00 AM
> > To: R-help at r-project.org
> > Subject: [R] Maximum number of variables allowed in a multiple
> > linearregression model
> >
> > Hi,
> >
> > I appreciate it if someone can confirm the maximum number of variables
> > allowed in a multiple linear regression model.  Currently, I am looking for
> > a software with the capacity of handling approximately 3,000 variables.  I
> > am using Excel to process the results.  Any information for processing a
> > matrix from Excel with hundreds to thousands of variables will helpful.
> >
> > Best Regards,
> > Michelle
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list