[R] Reasons to Use R (no memory limitations :-))

Mon Apr 16 05:23:34 CEST 2007

This thread discussed R memory limitations, compared handling with S and SAS. Since I routinely use R to process multi-gigababyte sets on computers with sometimes 256mb of memory - here are some comments on that. 

Most memory limitations vanish if R is used with any relational database. [My personal preference is SQLite (RSQLite packaga)  because of speed and no-admin (used in embedded mode)]. The comments below apply to any relational database, unless otherwise stated.

Most people appear to think about database tables as dataframes - that is to store and load the _whole_ dataframe in one go - probably because appropriate function names are suggesting this approach. Also, it is a natural mapping. This is convenient if the data set can fit fully in memory - but limits the size of the data set the same way as without using the database.

However, using SQL language directly one can expand the size of the data set R is capable of operating on - we just have to stop treating database tables as 'atomic'. For example, assume we have a set of several million patients and want to analyze some specific subset - the following SQL statement 
  SELECT * FROM patients WHERE gender='M" AND AGE BETWEEN 30 AND 35
will result in bringing to R much smaller dataframe than selection of the whole table. [Also, such subset selection may take _less_time_ then selecting from the total dataframe - assuming the table is properly indexed]. 
Also, direct SQL statements can be used to pre-compute some characteristics internally in the database and bring only the summaries to R:
 SELECT AVG(age) FROM patients GROUP BY gender
will bring a data frame of two rows only.

Admittedly, if the data set is really large and we cannot operate on its subsets, the above does not help. Though I do not believe that this would the the majority of the situations. 

Naturally, going for a 64bit system with enough memory will solve some problems without using the database -  but not all of them. Relational databases can be very efficient at selecting subsets as they do not have to do linear scans [when the tables are indexed] - while R has to do a linear scan every time(??? I did not look up the source code of R - please correct me if I am wrong). Two other areas where a database is better than R, especially for large data sets:
 - verification of data correctness for individual points [a frequent problem with large data sets]
 - combining data from several different types of tables into one dataframe

In summary: using SQL from R allows to process extremely large data sets in a limited memory, sometimes even faster then if we had a large memory and kept our data set fully in it. Relational database perfectly complements R capabilities.