[R] Large data files

Thomas Lumley thomas at biostat.washington.edu
Wed Dec 29 21:33:10 CET 1999

On Wed, 29 Dec 1999 cstrato at EUnet.at wrote:

> Dear R and S-Plus users:
> Currently I am using:
> at work: "S-Plus 2000 Pro" on a PC: Pentium II/350MHz, 256 MB RAM,
> running Win NT
> at home: "R" on my Mac PowerBook G3/292MHz, 128 MB RAM, running LinuxPPC
> Currently, at home I am trying to import a table(nrow=302500, ncol=6)
> which I have to do
> for each column extra because of memory problems. I have partially to
> use the columns,
> partially I have to convert them in to matrices(550 x 550) for doing
> calculations.
> Ultimately, I have to import many (ca 20-100) of these tables, which
> will be impossible
> on my current machines due to memory limitations.
> My question now is the following:
> At work I have access to the following multiprocessor machines:
> a, Compaq Proliant Server: 4 x Pentium II/450MHz, 2 GB RAM, Win NT
> b, Sun Enterprise 450 Server: 4 x SPARC/??MHz, 2 GB RAM, Solaris 2.6
> For testing purposes I would like to install "R":
> 1, Can R take advantage of multiprocessor machines?

Not really. You can run multiple copies of R, which lets you get four
things done at once, but R is not multithreaded.

> 2, Which machine would be better suited to run R on?

Either would work. We have done some very limited comparisons of speed on
machines here: the various test suites for the survival5 package run at
about the same speed on a new Sun Enterprise server and on a Pentium
II/400 under Linux, and run faster on a Pentium III/500 under WinNT, and
slower on an eighteen-month old Sun Enterprise 450 server.

The speeds are close enough that other factors are probably more important
(which system you prefer, how many other people you will annoy by taking
over the machine)

If you are doing a lot of simple linear algebra the Sun Workshop compilers
might be expected to have some advantages over gcc: I haven't found any
examples where it matters, but I don't work with very large matrices much.

> Finally, the question is:
> Is R or S-Plus better suited for handling such large data?
> Would "S-Plus 2000" for Win NT or "S-Plus 5" for Unix better suited?
> Can S-Plus take advantage of multiprocessor machines?

Neither R nor S-PLUS is particularly suited to handling large data.  I
believe S-PLUS has some multithreading, but that its main computations are
still done by a single processor. However, this is perhaps not the best
list to get information about S-PLUS.

You would be better off splitting the data into pieces using some other
program.  Either S-PLUS or R will handle 550x550 matrices perfectly
happily if you have that much memory.

Thomas Lumley
Assistant Professor, Biostatistics
University of Washington, Seattle

r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch

More information about the R-help mailing list