[R] Reading large datasets and fitting logistic models in R
Prof Brian Ripley
ripley at stats.ox.ac.uk
Sun Aug 10 08:18:43 CEST 2008
See also bigglm() in package biglm.
On Sat, 9 Aug 2008, Pradheep K E wrote:
> Hi R-experts,
>
> Does anyone have experience using R for handling large scale data (millions
> of rows, hundreds or thousands of features)?
>
> What is the largest size of data that anyone has used with glm?
I've used 700,000 rows and about 100 cols, but it was 4 years ago and we
have more memory now. It matters if the 'features' are numeric or
categorical, as the latter can expand to many columns in the model matrix.
As a rough guide, expect to need 200x as much memory in bytes as nrows x
ncols. Using glm.fit will be more efficient (I've just tested 100,000 x
100 which used 1.2Gb).
> Also, is there a library to read data in sparse data format (like SVMlight
> format)?
You mean *store* data in a sparse format when read in? I'm not sure of
the relevance, but look at the function method for bigglm for a way to
avoid even doing that. If the data are numeric there are at least three
sparse-matrix packages on CRAN.
Ultimately R's code such as glm() is designed for flexibility and to do
interesting things with the fit: for really large problems you will do
better to write a specialized fitting routine. bigglm() is an
intermediate position.
There's also the question of whether there are any interesting homogeneous
datasets of this sort of size. Often doing analyses on subsets and a
meta-analysis is a much more insightful approach (as it was in our
problem: we split on one of the categorical explanatory variables).
> Thanks
> Pradheep
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list