[R] large dataset
tlumley at u.washington.edu
Mon Mar 29 22:55:36 CEST 2010
On Mon, 29 Mar 2010, Gabor Grothendieck wrote:
> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <tlumley at u.washington.edu> wrote:
>> On Sun, 28 Mar 2010, kMan wrote:
>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>> Two hours is a *very* long time to transfer a csv file to a db. The author
>>> of the linked article has not documented how to use scan() arguments
>>> appropriately for the task. I take particular issue with the authors
>>> statement that "R is said to be slow, memory hungry and only capable of
>>> handling small datasets," indicating he/she has crummy informants and not
>>> challenged the notion him/herself.
>> I believe that *I* am the author of the particular statement you take issue
>> with (although not the of the rest of the page).
>> However, when I wrote it, it continued:
>> "R (and S) are accused of being slow, memory-hungry, and able to handle only
>> small data sets.
>> This is completely true.
>> Fortunately, computers are fast and have lots of memory. Data sets with a
>> few tens of thousands of observations can be handled in 256Mb of memory, and
>> quite large data sets with 1Gb of memory. Workstations with 32Gb or more to
>> handle millions of observations are still expensive (but in a few years
>> Moore's Law should catch up).
>> Tools for interfacing R with databases allow very large data sets, but this
>> isn't transparent to the user."
> I don`t think the last sentence is true if you use sqldf. Assuming
> the standard type of csv file accepted by sqldf:
> DF <- read.csv.sql("myfile.csv")
> is all you need. The install.packages statement downloads and
> installs sqldf, DBI and RSQLite (which in turn installs SQLite
> itself), and then read.csv.sql sets up the database and table layouts,
> reads the file into the database, reads the data from the database
> into R (bypassing R's read routines) and then destroys the database
> all transparently.
It's not the data reading that's the problem. As you say, sqldf handles that nicely. It's using a data set larger than memory that is not transparent -- you need special packages and can still only do a quite limited set of operations.
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
More information about the R-help