[R] large dataset

Tue Mar 30 06:42:21 CEST 2010

Dear Thomas,

While it may be true that "R (and S) are *accused* of being slow,
memory-hungry, and able to handle only
small data sets" (emphasis added), the accusation is  false, rendering the
*accusers* misinformed. Transparency is another, perhaps more interesting
matter. R-users can *experience* R as limited in the ways described above (a
functional limitation) while making a false technical assertion, without
generating a dichotomy. It is a bit like a cell phone example from
human-computer interaction circles in the 90s. The phone could technically
work, provided one is an engineer so as to make sense out of its interface,
while for most people, it may *functionally* be nothing more than a
paperweight. R is not "technically" limited in the way the accusation reads
(the point I was making), though many users are functionally limited so (the
point you seem to have made or at least passed along).

An R user can get far more data into memory as single objects with R than
with other stats packages; including matlab, JMP, and, obviously, excel.
This is just a simple comparison of the programs' documented environment
size and object limits. The difference in the same read/scan operation
between R and JMP on 600 Mb of data could easily be 25+ minutes (R perhaps
taking 5-7 minutes, with JMP taking 30+ minutes, assuming 1.8GHz & 3GB RAM I
used back when I made the comparison that sold me on R). R can do formal
operations with all that data in memory, assuming the environment is given
enough space to work with, while JMP will do the same operation in several
smaller chunks, reference the disk several times, AND on windows machines,
cause the OS to page. In that case, the differences can be upwards of a day.
With the ability to handle larger chunks at once, and direct control over
preventing one's OS from paging, R users should be able to crank out
analyses on very large datasets faster than other programs.

I am perfectly willing to accept that consumers of statistical software may
*experience* R as more limiting, in keeping with the accusations, that the
effect may be larger for newcomers, and even larger for newcomers after
controlling for transparency. I'd expect the effect  to reverse at around 3
years of experience, controlling for transparency or not. Large scale data
may present technical problems many users choose simply to avoid using R
for, so the effect may not reverse for these issues. Even when R is more
than capable of outperforming other programs, its usability (or access to
suitable documentation/training material) apparently isn't currently up to
the challenge. This is something the R community should be gnawing at the
bit to address. 

I'd think a consortium of sorts showcasing large-scale data support in R
would be a stellar contribution, and perhaps an issue of R-journal devoted
to the topic, say, of near worst-case scenario - 10Gb of data containing
different data types (categorical, numeric, & embedded matrices), in a .csv
file, header information somewhere else. Now how do the authors explain to
the beginner (say, <1 year experience with I/O) how to tackle getting the
data into a more suitable format, and then how did they analyze it 300Mb at
a time, all using R, in a non-cluster/single user environment, 32 bit, while
controlling for the environment size, missing data, and preventing paging?
How was their solution different when moving to 64 bit? Moving to a cluster?
One of the demos would certainly have to use scan() exclusively for I/O,
perhaps also demonstrating why the 'bad practice' part of working with raw
text files is something more than mere prescription.

Sincerely,
KeithC.

-----Original Message-----
From: Thomas Lumley [mailto:tlumley at u.washington.edu] 
Sent: Monday, March 29, 2010 2:56 PM
To: Gabor Grothendieck
Cc: kMan; r-help; n.vialma at libero.it
Subject: Re: [R] large dataset

On Mon, 29 Mar 2010, Gabor Grothendieck wrote:

> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <tlumley at u.washington.edu>
wrote:
>> On Sun, 28 Mar 2010, kMan wrote:
>>
>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>
>>>>
http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>
>>> rge_data/
>>>
>>> Two hours is a *very* long time to transfer a csv file to a db. The
author
>>> of the linked article has not documented how to use scan() arguments
>>> appropriately for the task. I take particular issue with the authors
>>> statement that "R is said to be slow, memory hungry and only capable of
>>> handling small datasets," indicating he/she has crummy informants and
not
>>> challenged the notion him/herself.
>>
>>
>> Ahem.
>>
>> I believe that *I* am the author of the particular statement you take
issue
>> with (although not the of the rest of the page).
>>
>> However, when I wrote it, it continued:
>> ---------
>> "R (and S) are accused of being slow, memory-hungry, and able to handle
only
>> small data sets.
>>
>> This is completely true.
>>
>> Fortunately, computers are fast and have lots of memory. Data sets with
 a
>> few tens of thousands of observations can be handled in 256Mb of memory,
and
>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more
to
>> handle millions of observations are still expensive (but in a few years
>> Moore's Law should catch up).
>>
>> Tools for interfacing R with databases allow very large data sets, but
this
>> isn't transparent to the user."
>
> I don`t think the last sentence is true if you use sqldf.   Assuming
> the standard type of csv file accepted by sqldf:
>
> install.packages("sqldf")
> library(sqldf)
> DF <- read.csv.sql("myfile.csv")
>
> is all you need.  The install.packages statement downloads and
> installs sqldf, DBI and RSQLite (which in turn installs SQLite
> itself), and then read.csv.sql sets up the database and table layouts,
> reads the file into the database, reads the data from the database
> into R (bypassing R's read routines) and then destroys the database
> all transparently.

It's not the data reading that's the problem. As you say, sqldf handles that
nicely.  It's using a data set larger than memory that is not transparent --
you need special packages and can still only do a quite limited set of
operations.

      -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle