[R] large dataset

Mon Mar 29 22:55:36 CEST 2010

On Mon, 29 Mar 2010, Gabor Grothendieck wrote:

> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <tlumley at u.washington.edu> wrote:
>> On Sun, 28 Mar 2010, kMan wrote:
>>
>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>
>>>> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>
>>> rge_data/
>>>
>>> Two hours is a *very* long time to transfer a csv file to a db. The author
>>> of the linked article has not documented how to use scan() arguments
>>> appropriately for the task. I take particular issue with the authors
>>> statement that "R is said to be slow, memory hungry and only capable of
>>> handling small datasets," indicating he/she has crummy informants and not
>>> challenged the notion him/herself.
>>
>>
>> Ahem.
>>
>> I believe that *I* am the author of the particular statement you take issue
>> with (although not the of the rest of the page).
>>
>> However, when I wrote it, it continued:
>> ---------
>> "R (and S) are accused of being slow, memory-hungry, and able to handle only
>> small data sets.
>>
>> This is completely true.
>>
>> Fortunately, computers are fast and have lots of memory. Data sets with  a
>> few tens of thousands of observations can be handled in 256Mb of memory, and
>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more to
>> handle millions of observations are still expensive (but in a few years
>> Moore's Law should catch up).
>>
>> Tools for interfacing R with databases allow very large data sets, but this
>> isn't transparent to the user."
>
> I don`t think the last sentence is true if you use sqldf.   Assuming
> the standard type of csv file accepted by sqldf:
>
> install.packages("sqldf")
> library(sqldf)
> DF <- read.csv.sql("myfile.csv")
>
> is all you need.  The install.packages statement downloads and
> installs sqldf, DBI and RSQLite (which in turn installs SQLite
> itself), and then read.csv.sql sets up the database and table layouts,
> reads the file into the database, reads the data from the database
> into R (bypassing R's read routines) and then destroys the database
> all transparently.

It's not the data reading that's the problem. As you say, sqldf handles that nicely.  It's using a data set larger than memory that is not transparent -- you need special packages and can still only do a quite limited set of operations.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle