[R] Enormous Datasets

Peter Dalgaard p.dalgaard at biostat.ku.dk
Thu Nov 18 22:07:23 CET 2004


Thomas W Volscho <THOMAS.VOLSCHO at huskymail.uconn.edu> writes:

> Dear List, I have some projects where I use enormous datasets. For
> instance, the 5% PUMS microdata from the Census Bureau. After
> deleting cases I may have a dataset with 7 million+ rows and 50+
> columns. Will R handle a datafile of this size? If so, how?

With a big machine... If that is numeric, non-integer data, you are
looking at something like 

> 7e6*50*8
[1] 2.8e+09

i.e. roughly 3 GB of data for one copy of the data set. You easily
find yourself with multiple copies, so I suppose a machine with 16GB
RAM would cut it. These days that basically suggests x86_64
architecture running Linux (e.g. multiprocessor Opterons), but there
are also 64 bit Unix "big iron" solutions (Sun, IBM, HP,...).

If you can avoid dealing with the whole dataset at once, smaller
machines might get you there. Notice that 1 column is "only" 56MB, and
you may be able to work with aggregated data from some step onwards. 


-- 
   O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: (+45) 35327907




More information about the R-help mailing list