[R] Linear models over large datasets
alpatici at gmail.com
Thu Aug 16 22:24:08 CEST 2007
I'd like to fit linear models on very large datasets. My data frames
are about 2000000 rows x 200 columns of doubles and I am using an 64
bit build of R. I've googled about this extensively and went over the
"R Data Import/Export" guide. My primary issue is although my data
represented in ascii form is 4Gb in size (therefore much smaller
considered in binary), R consumes about 12Gb of virtual memory.
What exactly are my options to improve this? I looked into the biglm
package but the problem with it is it uses update() function and is
therefore not transparent (I am using a sophisticated script which is
hard to modify). I really liked the concept behind the LM package
But it is no longer available. How could one fit linear models to very
large datasets without loading the entire set into memory but from a
file/database (possibly through a connection) using a relatively
simple modification of standard lm()? Alternatively how could one
improve the memory usage of R given a large dataset (by changing some
default parameters of R or even using on-the-fly compression)? I don't
mind much higher levels of CPU time required.
Thank you in advance for your help.
More information about the R-help