[R] Processing large datasets

Wed May 25 16:46:57 CEST 2011

Take a look at the High-Performance and Parallel Computing with R CRAN Task View:

  http://cran.us.r-project.org/web/views/HighPerformanceComputing.html

specifically at the section labeled "Large memory and out-of-memory data".

There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all.

I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details.

You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information.

HTH,

Marc Schwartz

On May 25, 2011, at 8:49 AM, Roman Naumenko wrote:

> Thanks Jonathan. 
> 
> I'm already using RMySQL to load data for couple of days. 
> I wanted to know what are the relevant R capabilities if I want to process much bigger tables. 
> 
> R always reads the whole set into memory and this might be a limitation in case of big tables, correct? 
> Doesn't it use temporary files or something similar to deal such amount of data? 
> 
> As an example I know that SAS handles sas7bdat files up to 1TB on a box with 76GB memory, without noticeable issues. 
> 
> --Roman 
> 
> ----- Original Message -----
> 
>> In cases where I have to parse through large datasets that will not
>> fit into R's memory, I will grab relevant data using SQL and then
>> analyze said data using R. There are several packages designed to do
>> this, like [1] and [2] below, that allow you to query a database
>> using
>> SQL and end up with that data in an R data.frame.
> 
>> [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
>> [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html
> 
>> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko
>> <roman at bestroman.com> wrote:
>>> Hi R list,
>>> 
>>> I'm new to R software, so I'd like to ask about it is capabilities.
>>> What I'm looking to do is to run some statistical tests on quite
>>> big
>>> tables which are aggregated quotes from a market feed.
>>> 
>>> This is a typical set of data.
>>> Each day contains millions of records (up to 10 non filtered).
>>> 
>>> 2011-05-24 750 Bid DELL 14130770 400
>>> 15.4800 BATS 35482391 Y 1 1 0 0
>>> 2011-05-24 904 Bid DELL 14130772 300
>>> 15.4800 BATS 35482391 Y 1 0 0 0
>>> 2011-05-24 904 Bid DELL 14130773 135
>>> 15.4800 BATS 35482391 Y 1 0 0 0
>>> 
>>> I'll need to filter it out first based on some criteria.
>>> Since I keep it mysql database, it can be done through by query.
>>> Not
>>> super efficient, checked it already.
>>> 
>>> Then I need to aggregate dataset into different time frames (time
>>> is
>>> represented in ms from midnight, like 35482391).
>>> Again, can be done through a databases query, not sure what gonna
>>> be faster.
>>> Aggregated tables going to be much smaller, like thousands rows per
>>> observation day.
>>> 
>>> Then calculate basic statistic: mean, standard deviation, sums etc.
>>> After stats are calculated, I need to perform some statistical
>>> hypothesis tests.
>>> 
>>> So, my question is: what tool faster for data aggregation and
>>> filtration
>>> on big datasets: mysql or R?
>>> 
>>> Thanks,
>>> --Roman N.