[R] memory problem in handling large dataset

Berton Gunter gunter.berton at gene.com
Thu Oct 27 18:49:33 CEST 2005


I think the general advice is that around 1/4 or 1/3 of your available
memory is about the largest data set that R can handle -- and often
considerably less depending upon what you do and how you do it (because R's
semantics require explicitly copying objects rather than passing pointers).
Fancy tricks using environments might enable you to do better, but that
requires advice from a true guru, which I ain't.

See ?connections, ?scan, ?seek  for reading in a file a chunk at a time from
a connection, thus enabling you to sample one line of data from each chunk,
say.

I suppose you could do this directly with repeated calls to scan() or
read.table() by skipping more and more lines at the beginning at each call,
but I assume that is horridly inefficient and would take forever.

HTH.

-- Bert Gunter
Genentech Non-Clinical Statistics
South San Francisco, CA
 
"The business of the statistician is to catalyze the scientific learning
process."  - George E. P. Box
 
 

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
> Sent: Thursday, October 27, 2005 9:28 AM
> To: r-help
> Subject: [R] memory problem in handling large dataset
> 
> Dear Listers:
> I have a question on handling large dataset. I searched R-Search and I
> hope I can get more information as to my specific case.
> 
> First, my dataset has 1.7 billion observations and 350 variables,
> among which, 300 are float and 50 are integers.
> My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
> plan to buy more memory).
> 
> > R.version
>          _
> platform i686-redhat-linux-gnu
> arch     i686
> os       linux-gnu
> system   i686, linux-gnu
> status
> major    2
> minor    1.1
> year     2005
> month    06
> day      20
> language R
> 
> 
> If I want to do some analysis for example like randomForest on a
> dataset, how many max observations can I load to get the machine run
> smoothly?
> 
> After figuring out that number, I want to do some sampling first, but
> I did not find read.table or scan can do this. I guess I can load it
> into mysql and then use RMySQL do the sampling or use python to subset
> the data first. My question is, is there a way I can subsample
> directly from file just using R?
> 
> Thanks,
> --
> Weiwei Shi, Ph.D
> 
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list