[R] memory problem in handling large dataset
Weiwei Shi
helprhelp at gmail.com
Thu Oct 27 19:27:25 CEST 2005
Dear Andy:
I think our emails crossed. But thanks as before.
Weiwei
On 10/27/05, Liaw, Andy <andy_liaw at merck.com> wrote:
> If my calculation is correct (very doubtful, sometimes), that's
>
> > 1.7e9 * (300 * 8 + 50 * 4) / 1024^3
> [1] 4116.446
>
> or over 4 terabytes, just to store the data in memory.
>
> To sample rows and read that into R, Bert's suggestion of using connections,
> perhaps along with seek() for skipping ahead, would be what I'd try. I had
> try to do such things in Python as a chance to learn that language, but I
> found operationally it's easier to maintain the project by doing everything
> in one language, namely R, if possible.
>
> Andy
>
>
> > From: Berton Gunter
> >
> > I think the general advice is that around 1/4 or 1/3 of your available
> > memory is about the largest data set that R can handle -- and often
> > considerably less depending upon what you do and how you do
> > it (because R's
> > semantics require explicitly copying objects rather than
> > passing pointers).
> > Fancy tricks using environments might enable you to do
> > better, but that
> > requires advice from a true guru, which I ain't.
> >
> > See ?connections, ?scan, ?seek for reading in a file a chunk
> > at a time from
> > a connection, thus enabling you to sample one line of data
> > from each chunk,
> > say.
> >
> > I suppose you could do this directly with repeated calls to scan() or
> > read.table() by skipping more and more lines at the beginning
> > at each call,
> > but I assume that is horridly inefficient and would take forever.
> >
> > HTH.
> >
> > -- Bert Gunter
> > Genentech Non-Clinical Statistics
> > South San Francisco, CA
> >
> > "The business of the statistician is to catalyze the
> > scientific learning
> > process." - George E. P. Box
> >
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at stat.math.ethz.ch
> > > [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Weiwei Shi
> > > Sent: Thursday, October 27, 2005 9:28 AM
> > > To: r-help
> > > Subject: [R] memory problem in handling large dataset
> > >
> > > Dear Listers:
> > > I have a question on handling large dataset. I searched
> > R-Search and I
> > > hope I can get more information as to my specific case.
> > >
> > > First, my dataset has 1.7 billion observations and 350 variables,
> > > among which, 300 are float and 50 are integers.
> > > My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
> > > plan to buy more memory).
> > >
> > > > R.version
> > > _
> > > platform i686-redhat-linux-gnu
> > > arch i686
> > > os linux-gnu
> > > system i686, linux-gnu
> > > status
> > > major 2
> > > minor 1.1
> > > year 2005
> > > month 06
> > > day 20
> > > language R
> > >
> > >
> > > If I want to do some analysis for example like randomForest on a
> > > dataset, how many max observations can I load to get the machine run
> > > smoothly?
> > >
> > > After figuring out that number, I want to do some sampling
> > first, but
> > > I did not find read.table or scan can do this. I guess I can load it
> > > into mysql and then use RMySQL do the sampling or use
> > python to subset
> > > the data first. My question is, is there a way I can subsample
> > > directly from file just using R?
> > >
> > > Thanks,
> > > --
> > > Weiwei Shi, Ph.D
> > >
> > > "Did you always know?"
> > > "No, I did not. But I believed..."
> > > ---Matrix III
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide!
> > > http://www.R-project.org/posting-guide.html
> > >
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide!
> > http://www.R-project.org/posting-guide.html
> >
> >
>
>
> ------------------------------------------------------------------------------
> Notice: This e-mail message, together with any attachment...{{dropped}}
More information about the R-help
mailing list