[R] Large number of dummy variables
Martin Maechler
maechler at stat.math.ethz.ch
Tue Jul 22 16:07:14 CEST 2008
>>>>> "HaroldD" == Doran, Harold <HDoran at air.org>
>>>>> on Mon, 21 Jul 2008 19:15:37 -0400 writes:
HaroldD> Well, yes and no. In R there really isn't a need to create the model matrix because this is done in R from the factors. But, to implement this computational trick Alan is asking about, it requires that he first create the full, dense model matrix and the do the time-demeaning on that matrix.
HaroldD> If lm() could go straight from a factor to a sparse
HaroldD> model matrix, time-demeaning would not be necessary.
Well, lm() is in "stats" would only work with dense matrices
anyway.
But you are right in what you *meant*:
We'd need versions of model.frame() and model.matrix() which
from a formula produce a sparse model matrix (aka "X matrix") or
its transpose.
Doug Bates showed you how to do the latter manually,
equivalently to model.matrix(~ 0 + f1 + f2) when f1 and f2 are
factors.
I'm sure that longer-term we'd want versions of model.matrix()
/ model.frame() that work with sparse matrices.
HaroldD> Doing work as Doug suggests in the other
HaroldD> post is what would be best for now, me thinks.
Yes.
BTW, you mentioned SparseM's "OLS with sparse matrices".
The problem there is the same as with 'Matrix': You must somehow
get your sparse X matrix and the best currrent tools to that, AFAIK,
are the ones in 'Matrix' Doug Bates mentioned (and wrote!).
Martin Maechler
HaroldD> -----Original Message-----
HaroldD> From: Bert Gunter [mailto:gunter.berton at gene.com]
HaroldD> Sent: Mon 7/21/2008 6:45 PM
HaroldD> To: Doran, Harold; aspearot at ucsc.edu; r-help at r-project.org
HaroldD> Subject: RE: [R] Large number of dummy variables
HaroldD> Unless I'm way off base, dummy variable are never needed (nor are desirable)
HaroldD> in R; they should be modelled as factors instead. AN INTRO TO R might, and
HaroldD> certainly V&R's MASS and others will, explain this in more detail.
HaroldD> -- Bert Gunter
HaroldD> Genentech, Inc.
HaroldD> -----Original Message-----
HaroldD> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
HaroldD> Behalf Of Doran, Harold
HaroldD> Sent: Monday, July 21, 2008 3:16 PM
HaroldD> To: aspearot at ucsc.edu; r-help at r-project.org
HaroldD> Cc: Douglas Bates
HaroldD> Subject: Re: [R] Large number of dummy variables
HaroldD> Well, at the risk of entering a debate I really don't have time for (I'm
HaroldD> doing it anyway) why not consider a random coefficient model? If your
HaroldD> response has anything like, "well, random effects and fixed effects are
HaroldD> correlated and so the estimates are biased but OLS is consistent and
HaroldD> unbiased via an appeal to Gauss-Markov" then I will probably make time
HaroldD> for this discussion :)
HaroldD> I have experienced this problem, though. In what you're doing, you are
HaroldD> first creating the model matrix and then doing the demeaning, correct? I
HaroldD> do recall Doug Bates was, at one point, doing some work where the model
HaroldD> matrix for the fixed effects was immediately created as a sparse matrix
HaroldD> for OLS models. I think doing the work on the sparse matrix is a better
HaroldD> analytical method than time-demeaning. I don't remember where that work
HaroldD> is, though.
HaroldD> There is a package called sparseM which had functions for doing OLS with
HaroldD> sparse matrices. I don't know its status, but vaguely recall the author
HaroldD> of sparseM at one point noting that the work of Bates and Maechler would
HaroldD> be the go to package for work with large, sparse model matrices.
>> -----Original Message-----
>> From: r-help-bounces at r-project.org
>> [mailto:r-help-bounces at r-project.org] On Behalf Of Alan Spearot
>> Sent: Monday, July 21, 2008 5:59 PM
>> To: r-help at r-project.org
>> Subject: [R] Large number of dummy variables
>>
>> Hello,
>>
>> I'm trying to run a regression predicting trade flows between
>> importers and exporters. I wish to include both
>> year-importer dummies and year-exporter dummies. The former
>> includes 1378 levels, and the latter includes 1390 levels. I
>> have roughly 100,000 total observations.
>>
>> When I'm using lm() to run a simple regression, it give me a
>> "cannot allocate ___" error. I've been able to get around
>> time-demeaning over one large group, but since I have two, it
>> doesn't work in the correct way. Is there a more efficient
>> way to handling a model matrix this large in R?
>>
>> Thanks for your help.
>>
>> Alan Spearot
>>
>> --
>> Alan Spearot
>> Assistant Professor - International Economics University of
>> California - Santa Cruz
>> 1156 High Street
>> 453 Engineering 2
>> Santa Cruz, CA 95064
>> Office: (831) 459-1530
>> acspearot at gmail.com
>> http://people.ucsc.edu/~aspearot
>>
More information about the R-help
mailing list