[R] Computations slow in spite of large amounts of RAM.

Tue Jul 1 16:37:24 CEST 2003

> From: Huiqin Yang [mailto:Huiqin.Yang at noaa.gov] 
> 
> Hi all,
> 
> I am a beginner trying to use R to work with large amounts of 
> oceanographic data, and I find that computations can be VERY 
> slow.  In particular, computational speed seems to depend 
> strongly on the number and size of the objects that are 
> loaded (when R starts up).  The same computations are 
> significantly faster when all but the essential objects are 
> removed.  I am running R on a machine with 16 GB of RAM, and 
> our unix system manager assures me that there is memory 
> available to my R process that has not been used.
> 
> 1.  Is the problem associated with how R uses memory?  If so, 
> is there some way to increase the amount of memory used by my 
> R process to get better performance?

Is R compiled as 64-bit?  If not, it won't be able to use more than 4GB of
RAM (that's my understanding, anyway).

R keeps objects in memory, so if you are working with large amount of data,
it's a good habit to keep only the absolute essential objects in the
workspace, and save() and rm() things you don't need for the computation.

> 
> The computations that are particularly slow involve looping 
> with by().  The data are measurements of vertical profiles of 
> pressure, temperature, and salinity at a number of stations, 
> which are organized into a dataframe p.1 (1925930 rows, 8 
> columns: id, p, t, and s, etc.), and the objective is to get 
> a much smaller dataframe and the unique 
> values for ID is 1409 with the minimum and maximum pressure 
> for each profile.  The slow part is:
> 
> h.maxmin <- by(p.1,p.1$id,function(x){
>              data.frame(id=x$id[1],
>                       maxp=max(x$p),
>                       minp=min(x$p))})
> 
> 2.  Even with unneeded data objects removed, this is very 
> slow.  Is there a faster way to get the maximum and minimum values?

Why do you need to use by(), and why have the function return a data frame
containing only one row?  Here's an experiment on my 900MHz PIII laptop:

> n <- 1e5
> dat <- data.frame(id = sort(sample(LETTERS, n, replace=TRUE)),
+                   p = rnorm(n))
> 
> 
> system.time(h.maxmin <- by(dat, dat$id,function(x) {
+   data.frame(id=x$id[1], maxp=max(x$p), minp=min(x$p))}))
[1] 2.75 0.01 2.78   NA   NA
> system.time(junk <- tapply(dat$p, dat$id, function(x) range(x)))
[1] 0.12 0.01 0.13   NA   NA

If you want to coerce the result to a data frame with id as row names and
min and max as the two variables, you can do:

  junk.dat <- as.data.frame(do.call("rbind", junk))

HTH,
Andy

> platform sparc-sun-solaris2.9
> arch     sparc               
> os       solaris2.9          
> system   sparc, solaris2.9   
> status                       
> major    1                   
> minor    7.0                 
> year     2003                
> month    04                  
> day      16                  
> language R             
> 
> Thank you for your time.
> 
> Helen
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list 
> https://www.stat.math.ethz.ch/mailman/listinfo> /r-help
> 

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, ...{{dropped}}