[R] memory once again

Sat Mar 4 00:29:11 CET 2006

Berton Gunter wrote:
> Thanks, Duncan.
> 
> I would say that your clarification defines what I mean by "incapable of
> dealing with large data sets."  To whit: one must handcraft solutions
> working a chunk at a time versus having some sort of built-in virtual memory
> procedure handle it automatically.

The general point of view in R is that that's the job of the operating 
system.  R is currently artificially limited to 4 billion element 
vectors, but the elements could each be large.  Presumably some future 
release of R will switch to 64 bit indexing, and then R will be able to 
handle petabytes of data if the operating system can provide the virtual 
memory.

If you want to handle really big datasets transparently now, I think 
S-PLUS has something along the lines you are talking about, but I 
haven't tried it.

Duncan Murdoch

 > But as Andy Liaw suggested to me off
> list, maybe I fantacize the existence of any software that could deal with,
> say, terrabytes or petabytes of data without such handcrafting. My son, the
> computer scientist, tells me that the astronomers and physicists he works
> with routinely produce such massive data sets, as do imaging folks of all
> stripes I would imagine.
> 
> I wonder if our 20th century statistical modeling paradigms are increasingly
> out of step with such 21st century massive data realities... But that is a
> much more vexing issue that does not belong here. 
> 
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>   
> 
> 
>>-----Original Message-----
>>From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
>>Sent: Friday, March 03, 2006 12:31 PM
>>To: Berton Gunter
>>Cc: dimitrijoe at ipea.gov.br; 'R-Help'
>>Subject: Re: [R] memory once again
>>
>>On 3/3/2006 2:42 PM, Berton Gunter wrote:
>>
>>>What you propose is not really a solution, as even if your 
>>
>>data set didn't
>>
>>>break the modified precision, another would. And of course, 
>>
>>there is a price
>>
>>>to be paid for reduced numerical precision.
>>>
>>>The real issue is that R's current design is incapable of 
>>
>>dealing with data
>>
>>>sets larger than what can fit in physical memory (expert
>>>comment/correction?). 
>>
>>It can deal with big data sets, just not nearly as conveniently as it 
>>deals with ones that fit in memory.  The most straightforward way is 
>>probably to put them in a database, and use RODBC or one of the 
>>database-specific packages to read the data in blocks.  (You 
>>could also 
>>leave the data in a flat file and read it a block at a time 
>>from there, 
>>but the database is probably worth the trouble:  other people 
>>have done 
>>the work involved in sorting, selecting, etc.)
>>
>>The main problem you'll run into is that almost none of the R 
>>functions 
>>know about databases, so you'll end up doing a lot of work to rewrite 
>>the algorithms to work one block at a time, or on a random sample of 
>>data, or whatever.
>>
>>The original poster didn't say what he wanted to do with his 
>>data, but 
>>if he only needs to work with a few variables at a time, he 
>>can easily 
>>fit an 820,000 x N dataframe in memory, for small values of 
>>N.  Reading 
>>it in from a database would be easy.
>>
>>Duncan Murdoch
>>
>> > My understanding is that there is no way to change
>>
>>>this without a fundamental redesign of R. This means that 
>>
>>you must either
>>
>>>live with R's limitations or use other software for "large" 
>>
>>data sets.
>>
>>>-- Bert Gunter
>>>Genentech Non-Clinical Statistics
>>>South San Francisco, CA
>>> 
>>>"The business of the statistician is to catalyze the 
>>
>>scientific learning
>>
>>>process."  - George E. P. Box
>>> 
>>> 
>>>
>>>
>>>>-----Original Message-----
>>>>From: r-help-bounces at stat.math.ethz.ch 
>>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
>>>>Sent: Friday, March 03, 2006 11:28 AM
>>>>To: R-Help
>>>>Subject: [R] memory once again
>>>>
>>>>Dear all,
>>>>
>>>>A few weeks ago, I asked this list why small Stata files 
>>>>became huge R 
>>>>files. Thomas Lumley said it was because "Stata uses 
>>
>>single-precision 
>>
>>>>floating point by default and can use 1-byte and 2-byte 
>>>>integers. R uses 
>>>>double precision floating point and four-byte integers." And 
>>>>it seemed I 
>>>>couldn't do anythig about it.
>>>>
>>>>Is it true? I mean, isn't there a (more or less simple) 
>>
>>way to change 
>>
>>>>how R stores data (maybe by changing the source code and 
>>>>compiling it)?
>>>>
>>>>The reason why I insist in this point is because I am 
>>
>>trying to work 
>>
>>>>with a data frame with more than 820.000 observations and 80 
>>>>variables. 
>>>>The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G 
>>
>>RAM, Windows 
>>
>>>>XP, I could't do the import using the read.dta() function 
>>>>from package 
>>>>foreign. With Stat Transfer I managed to convert the Stata 
>>>>file to a S 
>>>>file of 350Mb, but my machine still didn't manage to 
>>
>>import it using 
>>
>>>>read.S().
>>>>
>>>>I even tried to "increase" my memory by memory.limit(4000), 
>>>>but it still 
>>>>didn't work.
>>>>
>>>>Regardless of the answer to my question, I'd appreciate to 
>>
>>hear about 
>>
>>>>your experience/suggestions in working with big files in R.
>>>>
>>>>Thank you for youR-Help,
>>>>
>>>>Dimitri Szerman
>>>>
>>>>______________________________________________
>>>>R-help at stat.math.ethz.ch mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide! 
>>>>http://www.R-project.org/posting-guide.html
>>>>
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide! 
>>
>>http://www.R-project.org/posting-guide.html
>>
>>