[R] memory once again

Duncan Murdoch murdoch at stats.uwo.ca
Fri Mar 3 21:31:14 CET 2006


On 3/3/2006 2:42 PM, Berton Gunter wrote:
> What you propose is not really a solution, as even if your data set didn't
> break the modified precision, another would. And of course, there is a price
> to be paid for reduced numerical precision.
> 
> The real issue is that R's current design is incapable of dealing with data
> sets larger than what can fit in physical memory (expert
> comment/correction?). 

It can deal with big data sets, just not nearly as conveniently as it 
deals with ones that fit in memory.  The most straightforward way is 
probably to put them in a database, and use RODBC or one of the 
database-specific packages to read the data in blocks.  (You could also 
leave the data in a flat file and read it a block at a time from there, 
but the database is probably worth the trouble:  other people have done 
the work involved in sorting, selecting, etc.)

The main problem you'll run into is that almost none of the R functions 
know about databases, so you'll end up doing a lot of work to rewrite 
the algorithms to work one block at a time, or on a random sample of 
data, or whatever.

The original poster didn't say what he wanted to do with his data, but 
if he only needs to work with a few variables at a time, he can easily 
fit an 820,000 x N dataframe in memory, for small values of N.  Reading 
it in from a database would be easy.

Duncan Murdoch

 > My understanding is that there is no way to change
> this without a fundamental redesign of R. This means that you must either
> live with R's limitations or use other software for "large" data sets.
> 
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>  
> "The business of the statistician is to catalyze the scientific learning
> process."  - George E. P. Box
>  
>  
> 
>> -----Original Message-----
>> From: r-help-bounces at stat.math.ethz.ch 
>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
>> Sent: Friday, March 03, 2006 11:28 AM
>> To: R-Help
>> Subject: [R] memory once again
>> 
>> Dear all,
>> 
>> A few weeks ago, I asked this list why small Stata files 
>> became huge R 
>> files. Thomas Lumley said it was because "Stata uses single-precision 
>> floating point by default and can use 1-byte and 2-byte 
>> integers. R uses 
>> double precision floating point and four-byte integers." And 
>> it seemed I 
>> couldn't do anythig about it.
>> 
>> Is it true? I mean, isn't there a (more or less simple) way to change 
>> how R stores data (maybe by changing the source code and 
>> compiling it)?
>> 
>> The reason why I insist in this point is because I am trying to work 
>> with a data frame with more than 820.000 observations and 80 
>> variables. 
>> The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows 
>> XP, I could't do the import using the read.dta() function 
>> from package 
>> foreign. With Stat Transfer I managed to convert the Stata 
>> file to a S 
>> file of 350Mb, but my machine still didn't manage to import it using 
>> read.S().
>> 
>> I even tried to "increase" my memory by memory.limit(4000), 
>> but it still 
>> didn't work.
>> 
>> Regardless of the answer to my question, I'd appreciate to hear about 
>> your experience/suggestions in working with big files in R.
>> 
>> Thank you for youR-Help,
>> 
>> Dimitri Szerman
>> 
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide! 
>> http://www.R-project.org/posting-guide.html
>>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html



More information about the R-help mailing list