[R] memory problem in handling large dataset
Weiwei Shi
helprhelp at gmail.com
Thu Oct 27 18:27:46 CEST 2005
Dear Listers:
I have a question on handling large dataset. I searched R-Search and I
hope I can get more information as to my specific case.
First, my dataset has 1.7 billion observations and 350 variables,
among which, 300 are float and 50 are integers.
My system has 8 G memory, 64bit CPU, linux box. (currently, we don't
plan to buy more memory).
> R.version
_
platform i686-redhat-linux-gnu
arch i686
os linux-gnu
system i686, linux-gnu
status
major 2
minor 1.1
year 2005
month 06
day 20
language R
If I want to do some analysis for example like randomForest on a
dataset, how many max observations can I load to get the machine run
smoothly?
After figuring out that number, I want to do some sampling first, but
I did not find read.table or scan can do this. I guess I can load it
into mysql and then use RMySQL do the sampling or use python to subset
the data first. My question is, is there a way I can subsample
directly from file just using R?
Thanks,
--
Weiwei Shi, Ph.D
"Did you always know?"
"No, I did not. But I believed..."
---Matrix III
More information about the R-help
mailing list