[R] memory once again
Duncan Murdoch
murdoch at stats.uwo.ca
Fri Mar 3 21:31:14 CET 2006
On 3/3/2006 2:42 PM, Berton Gunter wrote:
> What you propose is not really a solution, as even if your data set didn't
> break the modified precision, another would. And of course, there is a price
> to be paid for reduced numerical precision.
>
> The real issue is that R's current design is incapable of dealing with data
> sets larger than what can fit in physical memory (expert
> comment/correction?).
It can deal with big data sets, just not nearly as conveniently as it
deals with ones that fit in memory. The most straightforward way is
probably to put them in a database, and use RODBC or one of the
database-specific packages to read the data in blocks. (You could also
leave the data in a flat file and read it a block at a time from there,
but the database is probably worth the trouble: other people have done
the work involved in sorting, selecting, etc.)
The main problem you'll run into is that almost none of the R functions
know about databases, so you'll end up doing a lot of work to rewrite
the algorithms to work one block at a time, or on a random sample of
data, or whatever.
The original poster didn't say what he wanted to do with his data, but
if he only needs to work with a few variables at a time, he can easily
fit an 820,000 x N dataframe in memory, for small values of N. Reading
it in from a database would be easy.
Duncan Murdoch
> My understanding is that there is no way to change
> this without a fundamental redesign of R. This means that you must either
> live with R's limitations or use other software for "large" data sets.
>
> -- Bert Gunter
> Genentech Non-Clinical Statistics
> South San Francisco, CA
>
> "The business of the statistician is to catalyze the scientific learning
> process." - George E. P. Box
>
>
>
>> -----Original Message-----
>> From: r-help-bounces at stat.math.ethz.ch
>> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Dimitri Joe
>> Sent: Friday, March 03, 2006 11:28 AM
>> To: R-Help
>> Subject: [R] memory once again
>>
>> Dear all,
>>
>> A few weeks ago, I asked this list why small Stata files
>> became huge R
>> files. Thomas Lumley said it was because "Stata uses single-precision
>> floating point by default and can use 1-byte and 2-byte
>> integers. R uses
>> double precision floating point and four-byte integers." And
>> it seemed I
>> couldn't do anythig about it.
>>
>> Is it true? I mean, isn't there a (more or less simple) way to change
>> how R stores data (maybe by changing the source code and
>> compiling it)?
>>
>> The reason why I insist in this point is because I am trying to work
>> with a data frame with more than 820.000 observations and 80
>> variables.
>> The Stata file has 150Mb. With my Pentiun IV 2GHz and 1G RAM, Windows
>> XP, I could't do the import using the read.dta() function
>> from package
>> foreign. With Stat Transfer I managed to convert the Stata
>> file to a S
>> file of 350Mb, but my machine still didn't manage to import it using
>> read.S().
>>
>> I even tried to "increase" my memory by memory.limit(4000),
>> but it still
>> didn't work.
>>
>> Regardless of the answer to my question, I'd appreciate to hear about
>> your experience/suggestions in working with big files in R.
>>
>> Thank you for youR-Help,
>>
>> Dimitri Szerman
>>
>> ______________________________________________
>> R-help at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide!
>> http://www.R-project.org/posting-guide.html
>>
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list