[R] glm with large datasets
Julio Gonzalez Diaz
oiluj1 at gmail.com
Thu Feb 26 19:14:39 CET 2009
I have to run a logit regresion over a large dataset and I am not sure
about the best option to do it. The dataset is about 200000x2000 and R
runs out of memory when creating it.
After going over help archives and the mailing lists, I think there are
two main options, though I am not sure about which one will be better.
Of course, any alternative will be welcome as well.
Actually, I am not quite sure about whether any of these options will
work but, before getting into it, I would like to get some advice.
-A first option is to use the package ff, that allows to work with the
dataset without loading it into the RAM. This, combined with the bigglm
function should do the job.
-The dataset contains a lot of sparse variables, so I was wondering
whether creating the model matrix as a sparse matrix might deliver good
results. In this case, I am not sure about the capabilities of glm or
some extension of it to deal with sparse matrices (I could not find any
documentation about this). If possible, this second option seems more
efficient since R might be capable of using the fact that matrices are
sparse to speed up the computations.
Thanks in advance.
All the best!
More information about the R-help