[R] memory once again
Duncan Murdoch
murdoch at stats.uwo.ca
Sat Mar 4 00:29:11 CET 2006
Berton Gunter wrote:
> Thanks, Duncan.
>
> I would say that your clarification defines what I mean by "incapable of
> dealing with large data sets." To whit: one must handcraft solutions
> working a chunk at a time versus having some sort of built-in virtual memory
> procedure handle it automatically.
The general point of view in R is that that's the job of the operating
system. R is currently artificially limited to 4 billion element
vectors, but the elements could each be large. Presumably some future
release of R will switch to 64 bit indexing, and then R will be able to
handle petabytes of data if the operating system can provide the virtual
memory.
If you want to handle really big datasets transparently now, I think
S-PLUS has something along the lines you are talking about, but I
haven't tried it.
Duncan Murdoch
> But as Andy Liaw suggested to me off
> list, maybe I fantacize the existence of any software that could deal with,
> say, terrabytes or petabytes of data without such handcrafting. My son, the
> computer scientist, tells me that the astronomers and physicists he works
> with routinely produce such massive data sets, as do imaging folks of all
> stripes I would imagine.
>
> I wonder if our 20th century statistical modeling paradigms are increasingly
> out of step with such 21st century massive data realities... But that is a
> much more vexing issue that does not belong here.
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>
>
>
>>-----Original Message-----
>>From: Duncan Murdoch [mailto:murdoch at stats.uwo.ca]
>>Sent: Friday, March 03, 2006 12:31 PM
>>To: Berton Gunter
>>Cc: dimitrijoe at ipea.gov.br; 'R-Help'
>>Subject: Re: [R] memory once again
>>
>>On 3/3/2006 2:42 PM, Berton Gunter wrote:
>>
>>>What you propose is not really a solution, as even if your
>>
>>data set didn't
>>
>>>break the modified precision, another would. And of course,
>>
>>there is a price
>>
>>>to be paid for reduced numerical precision.
>>>
>>>The real issue is that R's current design is incapable of
>>
>>dealing with data
>>
>>>sets larger than what can fit in physical memory (expert
>>>comment/correction?).
>>
>>It can deal with big data sets, just not nearly as conveniently as it
>>deals with ones that fit in memory. The most straightforward way is
>>probably to put them in a database, and use RODBC or one of the
>>database-specific packages to read the data in blocks. (You
>>could also
>>leave the data in a flat file and read it a block at a time
>>from there,
>>but the database is probably worth the trouble: other people
>>have done
>>the work involved in sorting, selecting, etc.)
>>
>>The main problem you'll run into is that almost none of the R
>>functions
>>know about databases, so you'll end up doing a lot of work to rewrite
>>the algorithms to work one block at a time, or on a random sample of
>>data, or whatever.
>>
>>The original poster didn't say what he wanted to do with his
>>data, but
>>if he only needs to work with a few variables at a time, he
>>can easily
>>fit an 820,000 x N dataframe in memory, for small values of
>>N. Reading
>>it in from a database would be easy.
>>
>>Duncan Murdoch
>>
>> > My understanding is that there is no way to change
>>
>>>this without a fundamental redesign of R. This means that
>>
>>you must either
>>
>>>live with R's limitations or use other software for "large"
>>
>>data sets.
>>
>>>-- Bert Gunter
>>>Genentech Non-Clinical Statistics
>>>South San Francisco, CA
>>>
>>>"The business of the statistician is to catalyze the
>>
>>scientific learning
>>
>>>process." - George E. P. Box
>>>
>>>
>>>
>>>
>>>>-----Original Message-----
>>>>From: r-help-bounces at stat.math.ethz.ch
>>>>[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
>>>>Sent: Friday, March 03, 2006 11:28 AM
>>>>To: R-Help
>>>>Subject: [R] memory once again
>>>>
>>>>Dear all,
>>>>
>>>>A few weeks ago, I asked this list why small Stata files
>>>>became huge R
>>>>files. Thomas Lumley said it was because "Stata uses
>>
>>single-precision
>>
>>>>floating point by default and can use 1-byte and 2-byte
>>>>integers. R uses
>>>>double precision floating point and four-byte integers." And
>>>>it seemed I
>>>>couldn't do anythig about it.
>>>>
>>>>Is it true? I mean, isn't there a (more or less simple)
>>
>>way to change
>>
>>>>how R stores data (maybe by changing the source code and
>>>>compiling it)?
>>>>
>>>>The reason why I insist in this point is because I am
>>
>>trying to work
>>
>>>>with a data frame with more than 820.000 observations and 80
>>>>variables.
>>>>The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G
>>
>>RAM, Windows
>>
>>>>XP, I could't do the import using the read.dta() function
>>>>from package
>>>>foreign. With Stat Transfer I managed to convert the Stata
>>>>file to a S
>>>>file of 350Mb, but my machine still didn't manage to
>>
>>import it using
>>
>>>>read.S().
>>>>
>>>>I even tried to "increase" my memory by memory.limit(4000),
>>>>but it still
>>>>didn't work.
>>>>
>>>>Regardless of the answer to my question, I'd appreciate to
>>
>>hear about
>>
>>>>your experience/suggestions in working with big files in R.
>>>>
>>>>Thank you for youR-Help,
>>>>
>>>>Dimitri Szerman
>>>>
>>>>______________________________________________
>>>>R-help at stat.math.ethz.ch mailing list
>>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>>PLEASE do read the posting guide!
>>>>http://www.R-project.org/posting-guide.html
>>>>
>>>
>>>______________________________________________
>>>R-help at stat.math.ethz.ch mailing list
>>>https://stat.ethz.ch/mailman/listinfo/r-help
>>>PLEASE do read the posting guide!
>>
>>http://www.R-project.org/posting-guide.html
>>
>>
More information about the R-help
mailing list