[R] Computations slow in spite of large amounts of RAM.
Liaw, Andy
andy_liaw at merck.com
Tue Jul 1 16:37:24 CEST 2003
> From: Huiqin Yang [mailto:Huiqin.Yang at noaa.gov]
>
> Hi all,
>
> I am a beginner trying to use R to work with large amounts of
> oceanographic data, and I find that computations can be VERY
> slow. In particular, computational speed seems to depend
> strongly on the number and size of the objects that are
> loaded (when R starts up). The same computations are
> significantly faster when all but the essential objects are
> removed. I am running R on a machine with 16 GB of RAM, and
> our unix system manager assures me that there is memory
> available to my R process that has not been used.
>
> 1. Is the problem associated with how R uses memory? If so,
> is there some way to increase the amount of memory used by my
> R process to get better performance?
Is R compiled as 64-bit? If not, it won't be able to use more than 4GB of
RAM (that's my understanding, anyway).
R keeps objects in memory, so if you are working with large amount of data,
it's a good habit to keep only the absolute essential objects in the
workspace, and save() and rm() things you don't need for the computation.
>
> The computations that are particularly slow involve looping
> with by(). The data are measurements of vertical profiles of
> pressure, temperature, and salinity at a number of stations,
> which are organized into a dataframe p.1 (1925930 rows, 8
> columns: id, p, t, and s, etc.), and the objective is to get
> a much smaller dataframe and the unique
> values for ID is 1409 with the minimum and maximum pressure
> for each profile. The slow part is:
>
> h.maxmin <- by(p.1,p.1$id,function(x){
> data.frame(id=x$id[1],
> maxp=max(x$p),
> minp=min(x$p))})
>
> 2. Even with unneeded data objects removed, this is very
> slow. Is there a faster way to get the maximum and minimum values?
Why do you need to use by(), and why have the function return a data frame
containing only one row? Here's an experiment on my 900MHz PIII laptop:
> n <- 1e5
> dat <- data.frame(id = sort(sample(LETTERS, n, replace=TRUE)),
+ p = rnorm(n))
>
>
> system.time(h.maxmin <- by(dat, dat$id,function(x) {
+ data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}))
[1] 2.75 0.01 2.78 NA NA
> system.time(junk <- tapply(dat$p, dat$id, function(x) range(x)))
[1] 0.12 0.01 0.13 NA NA
If you want to coerce the result to a data frame with id as row names and
min and max as the two variables, you can do:
junk.dat <- as.data.frame(do.call("rbind", junk))
HTH,
Andy
> platform sparc-sun-solaris2.9
> arch sparc
> os solaris2.9
> system sparc, solaris2.9
> status
> major 1
> minor 7.0
> year 2003
> month 04
> day 16
> language R
>
> Thank you for your time.
>
> Helen
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, ...{{dropped}}
More information about the R-help
mailing list