[R] memory problem in handling large dataset

Thu Oct 27 19:24:56 CEST 2005

Hi, Jim:
Thanks for the calculation. I think you won't mind if I cc the reply
to r-help too so that I can get more info.

I assume you use 4 bytes for integer and 8 bytes for float, so
300x8+50x4=2600 bytes for each observation, right?

I wish I could have 500x8 G memory :) just kidding.. definately,
sampling will be proceeded as the first step. Some feature selections
(filtering, mainly) will be applied. Accepting Berton's suggestion, I
will probably use python to do the sampling since whenever I have some
"slow" situations like this, python never fails me. (I am not saying R
is bad though)

I understand "I get what I pay" here.  But more information or
experience on R's handling large dataset (like using RMySQL) will be
appreciated.

regards,

Weiwei

On 10/27/05, jim holtman <jholtman at gmail.com> wrote:
> Based on the numbers that you  gave, if you wanted all the data in memory at
> once, you would need 4.4TB of memory, about 500X what you currently have.
> Each of you observation will require about 2,600 bytes of memory.  You
> probably don't want to have more than 25% for a single object since many of
> the algorithms make copies.  This would limit you to about 700,000
> observations at a time for processing.
>
> The real question is what are you trying to do with the data.  Can you
> partition the data and do analysis on the subsets?
>
>
> On 10/27/05, Weiwei Shi <helprhelp at gmail.com> wrote:
> >
> > Dear Listers:
> > I have a question on handling large dataset. I searched R-Search and I
> > hope I can get more information as to my specific case.
> >
> > First, my dataset has 1.7 billion observations and 350 variables,
> > among which, 300 are float and 50 are integers.
> > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
> > plan to buy more memory).
> >
> > > R.version
> >         _
> > platform i686-redhat-linux-gnu
> > arch     i686
> > os       linux-gnu
> > system   i686, linux-gnu
> > status
> > major    2
> > minor    1.1
> > year     2005
> > month    06
> > day      20
> > language R
> >
> >
> > If I want to do some analysis for example like randomForest on a
> > dataset, how many max observations can I load to get the machine run
> > smoothly?
> >
> > After figuring out that number, I want to do some sampling first, but
> > I did not find read.table or scan can do this. I guess I can load it
> > into mysql and then use RMySQL do the sampling or use python to subset
> > the data first. My question is, is there a way I can subsample
> > directly from file just using R?
> >
> > Thanks,
> > --
> > Weiwei Shi, Ph.D
> >
> > "Did you always know?"
> > "No, I did not. But I believed..."
> > ---Matrix III
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> http://www.R-project.org/posting-guide.html
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 247 0281
>
> What the problem you are trying to solve?

--
Weiwei Shi, Ph.D

"Did you always know?"
"No, I did not. But I believed..."
---Matrix III