[R] glm with large datasets

Julio Gonzalez Diaz oiluj1 at gmail.com
Thu Feb 26 19:14:39 CET 2009

Hi all,

I have to run a logit regresion over a large dataset and I am not sure 
about the best option to do it. The dataset is about 200000x2000 and R 
runs out of memory when creating it.

After going over help archives and the mailing lists, I think there are 
two main options, though I am not sure about which one will be better. 
Of course, any alternative will be welcome as well.

Actually, I am not quite sure about whether any of these options will 
work but, before getting into it, I would like to get some advice.

-A first option is to use the package ff, that allows to work with the 
dataset without loading it into the RAM. This, combined with the bigglm 
function should do the job.

-The dataset contains a lot of sparse variables, so I was wondering 
whether creating the model matrix as a sparse matrix might deliver good 
results. In this case, I am not sure about the capabilities of glm or 
some extension of it to deal with sparse matrices (I could not find any 
documentation about this). If possible, this second option seems more 
efficient since R might be capable of using the fact that matrices are 
sparse to speed up the computations.

Thanks in advance.
All the best!

More information about the R-help mailing list