[R] Large number of dummy variables

Tue Jul 22 16:07:14 CEST 2008

>>>>> "HaroldD" == Doran, Harold <HDoran at air.org>
>>>>>     on Mon, 21 Jul 2008 19:15:37 -0400 writes:

    HaroldD> Well, yes and no. In R there really isn't a need to create the model matrix because this is done in R from the factors. But, to implement this computational trick Alan is asking about, it requires that he first create the full, dense model matrix and the do the time-demeaning on that matrix.

    HaroldD> If lm() could go straight from a factor to a sparse
    HaroldD> model matrix, time-demeaning would not be necessary.

Well,  lm() is in "stats" would only work with dense matrices
anyway.
But you are right in what you *meant*:  
We'd need versions of  model.frame() and model.matrix()  which
from a formula produce a sparse model matrix (aka "X matrix") or
its transpose.
Doug Bates showed you how to do the latter manually,
equivalently to  model.matrix(~ 0 + f1 + f2) when f1 and f2 are
factors.

I'm sure that longer-term we'd want versions of model.matrix()
/ model.frame() that work with sparse matrices.

    HaroldD> Doing work as Doug suggests in the other
    HaroldD> post is what would be best for now, me thinks.

Yes.
BTW, you mentioned  SparseM's  "OLS with sparse matrices".
The problem there is the same as with 'Matrix': You must somehow
get your sparse X matrix and the best currrent tools to that, AFAIK,
are the ones in 'Matrix' Doug Bates mentioned (and wrote!).

Martin Maechler

    HaroldD> -----Original Message-----
    HaroldD> From: Bert Gunter [mailto:gunter.berton at gene.com]
    HaroldD> Sent: Mon 7/21/2008 6:45 PM
    HaroldD> To: Doran, Harold; aspearot at ucsc.edu; r-help at r-project.org
    HaroldD> Subject: RE: [R] Large number of dummy variables

    HaroldD> Unless I'm way off base, dummy variable are never needed (nor are desirable)
    HaroldD> in R; they should be modelled as factors instead. AN INTRO TO R might, and
    HaroldD> certainly V&R's MASS and others will, explain this in more detail.

    HaroldD> -- Bert Gunter
    HaroldD> Genentech, Inc.

    HaroldD> -----Original Message-----
    HaroldD> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
    HaroldD> Behalf Of Doran, Harold
    HaroldD> Sent: Monday, July 21, 2008 3:16 PM
    HaroldD> To: aspearot at ucsc.edu; r-help at r-project.org
    HaroldD> Cc: Douglas Bates
    HaroldD> Subject: Re: [R] Large number of dummy variables

    HaroldD> Well, at the risk of entering a debate I really don't have time for (I'm
    HaroldD> doing it anyway) why not consider a random coefficient model? If your
    HaroldD> response has anything like, "well, random effects and fixed effects are
    HaroldD> correlated and so the estimates are biased but OLS is consistent and
    HaroldD> unbiased via an appeal to Gauss-Markov" then I will probably make time
    HaroldD> for this discussion :)

    HaroldD> I have experienced this problem, though. In what you're doing, you are
    HaroldD> first creating the model matrix and then doing the demeaning, correct? I
    HaroldD> do recall Doug Bates was, at one point, doing some work where the model
    HaroldD> matrix for the fixed effects was immediately created as a sparse matrix
    HaroldD> for OLS models. I think doing the work on the sparse matrix is a better
    HaroldD> analytical method than time-demeaning. I don't remember where that work
    HaroldD> is, though. 

    HaroldD> There is a package called sparseM which had functions for doing OLS with
    HaroldD> sparse matrices. I don't know its status, but vaguely recall the author
    HaroldD> of sparseM at one point noting that the work of Bates and Maechler would
    HaroldD> be the go to package for work with large, sparse model matrices.

    >> -----Original Message-----
    >> From: r-help-bounces at r-project.org 
    >> [mailto:r-help-bounces at r-project.org] On Behalf Of Alan Spearot
    >> Sent: Monday, July 21, 2008 5:59 PM
    >> To: r-help at r-project.org
    >> Subject: [R] Large number of dummy variables
    >> 
    >> Hello,
    >> 
    >> I'm trying to run a regression predicting trade flows between 
    >> importers and exporters.  I wish to include both 
    >> year-importer dummies and year-exporter dummies.  The former 
    >> includes 1378 levels, and the latter includes 1390 levels.  I 
    >> have roughly 100,000 total observations.
    >> 
    >> When I'm using lm() to run a simple regression, it give me a 
    >> "cannot allocate ___" error.  I've been able to get around 
    >> time-demeaning over one large group, but since I have two, it 
    >> doesn't work in the correct way.  Is there a more efficient 
    >> way to handling a model matrix this large in R?
    >> 
    >> Thanks for your help.
    >> 
    >> Alan Spearot
    >> 
    >> --
    >> Alan Spearot
    >> Assistant Professor - International Economics University of 
    >> California - Santa Cruz
    >> 1156 High Street
    >> 453 Engineering 2
    >> Santa Cruz, CA 95064
    >> Office:  (831) 459-1530
    >> acspearot at gmail.com
    >> http://people.ucsc.edu/~aspearot
    >>