[Rd] R as analysis server for very large data sets
George Ostrouchov
ostrouchovg@ornl.gov
Wed Feb 19 02:07:02 2003
At ORNL, we are building a system, ASPECT (Adaptive Simulation Product
Exploration and Control Toolkit), for analyzing output from massive
simulations. It is essentially a client server type setup that reads
netcdf and hdf files, and uses MPI for some distributed tasks. The total
output of a simulation can be terabytes, but individual variables can be
only a gigabyte and some relevant subsets even smaller. In theory, a
single variable can be handled on a 64 bit machine with a few gigabytes
of memory, say 10 GB. I understand that some folks have some success
running R on a 64 bit machine.
In addition to some home-grown distributed data analysis codes, we have
included a facility for calling a limited subset of R functions from
ASPECT. Simple use of R on a large data set did not work well. For
example, computing a simple histogram consumed several times (I think it
was 3 times) more memory than that required for the data itself. Some
editing to the hist.default function fixed the problem, but reduced the
generality of the function. The default seemed to generate a dimnames
attribute that became as large as the data. It may be that our initial
data matrix had some attributes we were not aware of.
It seems that generality and metadata generation in R run counter to R's
ability to handle large data sets. Can someone comment on this?
Are there functions in R that will strip a variable of all its
attributes, except the structure such as vector, matrix, or array? Or
are there options to prevent generating more attributes in some
functions? ... Perhaps an attribute to prevent further attributes?
Does it make sense to propose building (assuming that someone has time
to do it) a "large data" subset of R?
Thanks for your help,
George
----------------------------------------------------------
George Ostrouchov
Statistics and Data Sciences Group
Computer Science and Mathematics Division
Oak Ridge National Laboratory
----------------------------------------------------------