[R] memory once again

Sat Mar 4 00:05:36 CET 2006

Thanks, Duncan.

I would say that your clarification defines what I mean by "incapable of
dealing with large data sets."  To whit: one must handcraft solutions
working a chunk at a time versus having some sort of built-in virtual memory
procedure handle it automatically. But as Andy Liaw suggested to me off
list, maybe I fantacize the existence of any software that could deal with,
say, terrabytes or petabytes of data without such handcrafting. My son, the
computer scientist, tells me that the astronomers and physicists he works
with routinely produce such massive data sets, as do imaging folks of all
stripes I would imagine.

I wonder if our 20th century statistical modeling paradigms are increasingly
out of step with such 21st century massive data realities... But that is a
much more vexing issue that does not belong here. 

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA

> -----Original Message-----
> From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca] 
> Sent: Friday, March 03, 2006 12:31 PM
> To: Berton Gunter
> Cc: dimitrijoe at ipea.gov.br; 'R-Help'
> Subject: Re: [R] memory once again
> 
> On 3/3/2006 2:42 PM, Berton Gunter wrote:
> > What you propose is not really a solution, as even if your 
> data set didn't
> > break the modified precision, another would. And of course, 
> there is a price
> > to be paid for reduced numerical precision.
> > 
> > The real issue is that R's current design is incapable of 
> dealing with data
> > sets larger than what can fit in physical memory (expert
> > comment/correction?). 
> 
> It can deal with big data sets, just not nearly as conveniently as it 
> deals with ones that fit in memory.  The most straightforward way is 
> probably to put them in a database, and use RODBC or one of the 
> database-specific packages to read the data in blocks.  (You 
> could also 
> leave the data in a flat file and read it a block at a time 
> from there, 
> but the database is probably worth the trouble:  other people 
> have done 
> the work involved in sorting, selecting, etc.)
> 
> The main problem you'll run into is that almost none of the R 
> functions 
> know about databases, so you'll end up doing a lot of work to rewrite 
> the algorithms to work one block at a time, or on a random sample of 
> data, or whatever.
> 
> The original poster didn't say what he wanted to do with his 
> data, but 
> if he only needs to work with a few variables at a time, he 
> can easily 
> fit an 820,000 x N dataframe in memory, for small values of 
> N.  Reading 
> it in from a database would be easy.
> 
> Duncan Murdoch
> 
>  > My understanding is that there is no way to change
> > this without a fundamental redesign of R. This means that 
> you must either
> > live with R's limitations or use other software for "large" 
> data sets.
> > 
> > -- Bert Gunter
> > Genentech Non-Clinical Statistics
> > South San Francisco, CA
> >  
> > "The business of the statistician is to catalyze the 
> scientific learning
> > process."  - George E. P. Box
> >  
> >  
> > 
> >> -----Original Message-----
> >> From: r-help-bounces at stat.math.ethz.ch 
> >> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
> >> Sent: Friday, March 03, 2006 11:28 AM
> >> To: R-Help
> >> Subject: [R] memory once again
> >> 
> >> Dear all,
> >> 
> >> A few weeks ago, I asked this list why small Stata files 
> >> became huge R 
> >> files. Thomas Lumley said it was because "Stata uses 
> single-precision 
> >> floating point by default and can use 1-byte and 2-byte 
> >> integers. R uses 
> >> double precision floating point and four-byte integers." And 
> >> it seemed I 
> >> couldn't do anythig about it.
> >> 
> >> Is it true? I mean, isn't there a (more or less simple) 
> way to change 
> >> how R stores data (maybe by changing the source code and 
> >> compiling it)?
> >> 
> >> The reason why I insist in this point is because I am 
> trying to work 
> >> with a data frame with more than 820.000 observations and 80 
> >> variables. 
> >> The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G 
> RAM, Windows 
> >> XP, I could't do the import using the read.dta() function 
> >> from package 
> >> foreign. With Stat Transfer I managed to convert the Stata 
> >> file to a S 
> >> file of 350Mb, but my machine still didn't manage to 
> import it using 
> >> read.S().
> >> 
> >> I even tried to "increase" my memory by memory.limit(4000), 
> >> but it still 
> >> didn't work.
> >> 
> >> Regardless of the answer to my question, I'd appreciate to 
> hear about 
> >> your experience/suggestions in working with big files in R.
> >> 
> >> Thank you for youR-Help,
> >> 
> >> Dimitri Szerman
> >> 
> >> ______________________________________________
> >> R-help at stat.math.ethz.ch mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide! 
> >> http://www.R-project.org/posting-guide.html
> >>
> > 
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>