[R] Best practices for loading large datasets into R

Greg Snow 538280 at gmail.com
Thu Jan 9 18:53:36 CET 2014


You have made a good first start by keeping your data in a database
(it would be even slower if you read it in from a text file each
time).

The first suggestion is to not read in all the data, just bring in
what you need.  For early steps, exploring the data, getting a feel
for what you want to do, basic plots, etc. you may want to work with
just a sample of your data that will work quickly and easily, then
later you can have a script load the full data and analyze it based on
what you learned from the sample.

You can also have the database calculate (often quicker) some of the
summary statistics instead of bringing in the data to R.

The ff package has tools for storing large datasets on the disk with
just pointers in memory, then it will load in just those pieces that
you need so just parts of the data are in memory at any given time.

Also the biglm package has tools for working with just parts of the
data at a time.

Some of the tools for parallel processing can work well with large
datasets, the High Performance Computing Task View would be good for
you to skim through to see if any of those tools look useful to you.

On Wed, Jan 8, 2014 at 2:23 PM, James Mahon <james.mahon.3 at gmail.com> wrote:
> Hello,
>
> I'm working with a 22 GB datasets with ~100 million observations and ~40
> variables. It's store in SQLite and I use the RSQLite package to load it
> into memory. Loading the full population, even for only a few variables,
> can be very slow and I was wondering if there are best practices for how to
> manage large datasets when doing analysis in R. Is there an alternative
> file format / relational datbase in which I should be storing the data?
>
> Best,
>
> James
> --
> James F. Mahon III, Ph.D. Candidate
> Harvard University
> Tel: (857) 209-8438
> Fax: (270) 813-3498
> Web: http://www.people.fas.harvard.edu/~jmahon/
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com




More information about the R-help mailing list