[R] large dataset

Tue Mar 30 19:04:51 CEST 2010

KeithC,

If you're arguing that there should be more documentation and examples explaining how to use very large data sets with R, then I agree. Feel free to write some.

I've been giving tutorials on this for years now.  I wrote the first netCDF interface package for R because I needed to use data that wouldn't fit on a 64Mb system. I wrote the biglm package to handle out-of-core regression. My presentation at the last useR meeting was on how to automatically load variables on demand from a SQL connection.

It's still true that you can't treat large data sets and small data sets the same way, and I still think that it's even more important to point out that nearly everyone doesn't have large data and doesn't need to worry about these issues.

    -thomas

On Mon, 29 Mar 2010, kMan wrote:

> Dear Thomas,
>
> While it may be true that "R (and S) are *accused* of being slow,
> memory-hungry, and able to handle only
> small data sets" (emphasis added), the accusation is  false, rendering the
> *accusers* misinformed. Transparency is another, perhaps more interesting
> matter. R-users can *experience* R as limited in the ways described above (a
> functional limitation) while making a false technical assertion, without
> generating a dichotomy. It is a bit like a cell phone example from
> human-computer interaction circles in the 90s. The phone could technically
> work, provided one is an engineer so as to make sense out of its interface,
> while for most people, it may *functionally* be nothing more than a
> paperweight. R is not "technically" limited in the way the accusation reads
> (the point I was making), though many users are functionally limited so (the
> point you seem to have made or at least passed along).
>
> An R user can get far more data into memory as single objects with R than
> with other stats packages; including matlab, JMP, and, obviously, excel.
> This is just a simple comparison of the programs' documented environment
> size and object limits. The difference in the same read/scan operation
> between R and JMP on 600 Mb of data could easily be 25+ minutes (R perhaps
> taking 5-7 minutes, with JMP taking 30+ minutes, assuming 1.8GHz & 3GB RAM I
> used back when I made the comparison that sold me on R). R can do formal
> operations with all that data in memory, assuming the environment is given
> enough space to work with, while JMP will do the same operation in several
> smaller chunks, reference the disk several times, AND on windows machines,
> cause the OS to page. In that case, the differences can be upwards of a day.
> With the ability to handle larger chunks at once, and direct control over
> preventing one's OS from paging, R users should be able to crank out
> analyses on very large datasets faster than other programs.
>
> I am perfectly willing to accept that consumers of statistical software may
> *experience* R as more limiting, in keeping with the accusations, that the
> effect may be larger for newcomers, and even larger for newcomers after
> controlling for transparency. I'd expect the effect  to reverse at around 3
> years of experience, controlling for transparency or not. Large scale data
> may present technical problems many users choose simply to avoid using R
> for, so the effect may not reverse for these issues. Even when R is more
> than capable of outperforming other programs, its usability (or access to
> suitable documentation/training material) apparently isn't currently up to
> the challenge. This is something the R community should be gnawing at the
> bit to address.
>
> I'd think a consortium of sorts showcasing large-scale data support in R
> would be a stellar contribution, and perhaps an issue of R-journal devoted
> to the topic, say, of near worst-case scenario - 10Gb of data containing
> different data types (categorical, numeric, & embedded matrices), in a .csv
> file, header information somewhere else. Now how do the authors explain to
> the beginner (say, <1 year experience with I/O) how to tackle getting the
> data into a more suitable format, and then how did they analyze it 300Mb at
> a time, all using R, in a non-cluster/single user environment, 32 bit, while
> controlling for the environment size, missing data, and preventing paging?
> How was their solution different when moving to 64 bit? Moving to a cluster?
> One of the demos would certainly have to use scan() exclusively for I/O,
> perhaps also demonstrating why the 'bad practice' part of working with raw
> text files is something more than mere prescription.
>
> Sincerely,
> KeithC.
>
> -----Original Message-----
> From: Thomas Lumley [mailto:tlumley at u.washington.edu]
> Sent: Monday, March 29, 2010 2:56 PM
> To: Gabor Grothendieck
> Cc: kMan; r-help; n.vialma at libero.it
> Subject: Re: [R] large dataset
>
> On Mon, 29 Mar 2010, Gabor Grothendieck wrote:
>
>> On Mon, Mar 29, 2010 at 4:12 PM, Thomas Lumley <tlumley at u.washington.edu>
> wrote:
>>> On Sun, 28 Mar 2010, kMan wrote:
>>>
>>>>> This was *very* useful for me when I dealt with a 1.5Gb text file
>>>>>
>>>>>
> http://www.csc.fi/sivut/atcsc/arkisto/atcsc3_2007/ohjelmistot_html/R_and_la
>>>>
>>>> rge_data/
>>>>
>>>> Two hours is a *very* long time to transfer a csv file to a db. The
> author
>>>> of the linked article has not documented how to use scan() arguments
>>>> appropriately for the task. I take particular issue with the authors
>>>> statement that "R is said to be slow, memory hungry and only capable of
>>>> handling small datasets," indicating he/she has crummy informants and
> not
>>>> challenged the notion him/herself.
>>>
>>>
>>> Ahem.
>>>
>>> I believe that *I* am the author of the particular statement you take
> issue
>>> with (although not the of the rest of the page).
>>>
>>> However, when I wrote it, it continued:
>>> ---------
>>> "R (and S) are accused of being slow, memory-hungry, and able to handle
> only
>>> small data sets.
>>>
>>> This is completely true.
>>>
>>> Fortunately, computers are fast and have lots of memory. Data sets with
>  a
>>> few tens of thousands of observations can be handled in 256Mb of memory,
> and
>>> quite large data sets with 1Gb of memory.  Workstations with 32Gb or more
> to
>>> handle millions of observations are still expensive (but in a few years
>>> Moore's Law should catch up).
>>>
>>> Tools for interfacing R with databases allow very large data sets, but
> this
>>> isn't transparent to the user."
>>
>> I don`t think the last sentence is true if you use sqldf.   Assuming
>> the standard type of csv file accepted by sqldf:
>>
>> install.packages("sqldf")
>> library(sqldf)
>> DF <- read.csv.sql("myfile.csv")
>>
>> is all you need.  The install.packages statement downloads and
>> installs sqldf, DBI and RSQLite (which in turn installs SQLite
>> itself), and then read.csv.sql sets up the database and table layouts,
>> reads the file into the database, reads the data from the database
>> into R (bypassing R's read routines) and then destroys the database
>> all transparently.
>
> It's not the data reading that's the problem. As you say, sqldf handles that
> nicely.  It's using a data set larger than memory that is not transparent --
> you need special packages and can still only do a quite limited set of
> operations.
>
>      -thomas
>
> Thomas Lumley			Assoc. Professor, Biostatistics
> tlumley at u.washington.edu	University of Washington, Seattle
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle