[R] Replacing for loop with tapply!?

Sat Jun 11 02:36:43 CEST 2005

OK, so you want to find some summary statistics for each column, where
some columns could be completely missing. 

Writing a small wrapper should help. When you use apply(), you are
actually applying a function to every column (or row). First, let us
simulate a dataset with 15 days/rows and 10 stations/columns 

### simulate data
set.seed(1)            # for reproducibility 
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)  
mat[ mat > 45 ] <- NA  # create some missing values
mat[ ,9 ]       <- NA  # station 9's data is completely missing

Here are two example of such wrappers :

find.stats1 <- function( data, threshold=c(37,39,41) ){

  n   <- length(threshold)
  out <- matrix(  nrow=(n + 1), ncol=ncol(data) ) # initialise

  out[1, ] <- apply(data, 2, function(x) 
                         ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))

  for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )

  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )
}

find.stats2 <- function( data, threshold=c(37,39,41) ){

  n      <- length(threshold)
  excess <- numeric( n )
  out    <- matrix(  nrow=(n + 1), ncol=ncol(data) ) # initialise
  good   <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
  # colums that are not completely missing

  out[ , good] <- apply( data[ , good], 2, function(x){
    m <- max( x, na.rm=T )
    for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }
    return( c(m, excess) )
  } ) 

  rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
  colnames(out) <- rownames(data)  # name of the stations
  return( out )
}

find.stats1( mat )
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1    0     1
above_39     2    1    0    1    3    2    1    1    0     1
above_41     2    1    0    0    2    2    1    1    0     1

find.stats2( mat )
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max   44   42   39   41   45   43   42   45   NA    42
above_37     2    1    2    1    3    2    2    1   NA     1
above_39     2    1    0    1    3    2    1    1   NA     1
above_41     2    1    0    0    2    2    1    1   NA     1

On my laptop 'find.stats1' and 'find.stats2' (which is more flexible)
takes 7 and 6 seconds respectively to execute on a dataset with 10000
stations and 365 days.

Regards, Adai

On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:
> Dear all,
> 
> Dimitris and Andy, thanks for your great help. I have progressed to the 
> following code which runs very fast and effective:
> 
> mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
> mat[mat>45] <- NA
> mat<-NA
> mat
> temps <- c(35, 37, 39)
> ind <- rbind(
>      t(sapply(temps, function(temp)
>        rowSums(mat > temp, na.rm=TRUE) )),
>      rowSums(!is.na(mat), na.rm=FALSE),
>      apply(mat, 1, max, na.rm=TRUE))
> ind <- t(ind)
> ind
> 
> However, some weather stations have missing values for the whole year. 
> Unfortunately, the code breaks down (when uncommenting mat<-NA).
> 
> I have tried 'ifelse' statements in the functions, but it becomes even 
> more of a mess. I could subset the matrix before hand, but this would 
> mean merging with a complete matrix afterwards to make it compatible 
> with other years. That would slow things down.
> 
> How can I make the code robust for rows containing all missing values?
> 
> Thanks for your help,
> 
> Sander.
> 
> Dimitris Rizopoulos wrote:
> > for the maximum you could use something like:
> > 
> > ind[, 1] <- apply(mat, 2, max)
> > 
> > I hope it helps.
> > 
> > Best,
> > Dimitris
> > 
> > ----
> > Dimitris Rizopoulos
> > Ph.D. Student
> > Biostatistical Centre
> > School of Public Health
> > Catholic University of Leuven
> > 
> > Address: Kapucijnenvoer 35, Leuven, Belgium
> > Tel: +32/16/336899
> > Fax: +32/16/337015
> > Web: http://www.med.kuleuven.ac.be/biostat/
> >      http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> > 
> > 
> > 
> > ----- Original Message ----- 
> > From: "Sander Oom" <slist at oomvanlieshout.net>
> > To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
> > Cc: <r-help at stat.math.ethz.ch>
> > Sent: Friday, June 10, 2005 12:10 PM
> > Subject: Re: [R] Replacing for loop with tapply!?
> > 
> > 
> >>Thanks Dimitris,
> >>
> >>Very impressive! Much faster than before.
> >>
> >>Thanks to new found R.basic, I can simply rotate the result with
> >>rotate270{R.basic}:
> >>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>#ind <- matrix(0, length(temps), ncol(mat))
> >>>ind <- matrix(0, 4, ncol(mat))
> >>>(startDate <- date())
> >>[1] "Fri Jun 10 12:08:01 2005"
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind[4, ] <- colMeans(max(mat))
> >>Error in colMeans(max(mat)) : 'x' must be an array of at least two
> >>dimensions
> >>>(endDate <- date())
> >>[1] "Fri Jun 10 12:08:02 2005"
> >>>ind <- rotate270(ind)
> >>>ind[1:10,]
> >>   V4 V3 V2 V1
> >>1   0 56 75 80
> >>2   0 46 53 60
> >>3   0 50 58 67
> >>4   0 60 72 80
> >>5   0 59 68 76
> >>6   0 55 67 74
> >>7   0 62 77 93
> >>8   0 45 57 67
> >>9   0 57 68 75
> >>10  0 61 66 76
> >>
> >>However, I have not managed to get the row maximum using your 
> >>method? It
> >>should be 50 for most rows, but my first guess code gives an error!
> >>
> >>Any suggestions?
> >>
> >>Sander
> >>
> >>
> >>
> >>Dimitris Rizopoulos wrote:
> >>>maybe you are looking for something along these lines:
> >>>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>ind <- matrix(0, length(temps), ncol(mat))
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind
> >>>
> >>>
> >>>I hope it helps.
> >>>
> >>>Best,
> >>>Dimitris
> >>>
> >>>----
> >>>Dimitris Rizopoulos
> >>>Ph.D. Student
> >>>Biostatistical Centre
> >>>School of Public Health
> >>>Catholic University of Leuven
> >>>
> >>>Address: Kapucijnenvoer 35, Leuven, Belgium
> >>>Tel: +32/16/336899
> >>>Fax: +32/16/337015
> >>>Web: http://www.med.kuleuven.ac.be/biostat/
> >>>     http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> >>>
> >>>
> >>>----- Original Message ----- 
> >>>From: "Sander Oom" <slist at oomvanlieshout.net>
> >>>To: <r-help at stat.math.ethz.ch>
> >>>Sent: Friday, June 10, 2005 10:50 AM
> >>>Subject: [R] Replacing for loop with tapply!?
> >>>
> >>>
> >>>>Dear all,
> >>>>
> >>>>We have a large data set with temperature data for weather stations
> >>>>across the globe (15000 stations).
> >>>>
> >>>>For each station, we need to calculate the number of days a certain
> >>>>temperature is exceeded.
> >>>>
> >>>>So far we used the following S code, where mat88 is a matrix
> >>>>containing
> >>>>rows of 365 daily temperatures for each of 15000 weather stations:
> >>>>
> >>>>m <- 37
> >>>>n <- 2
> >>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
> >>>>for(i in 1:nrow(mat88)) {
> >>>># i <- 3
> >>>>row1 <- as.data.frame(df88[i,  ])
> >>>>temprow37 <- select.rows(row1, row1 > m)
> >>>>temprow39 <- select.rows(row1, row1 > m + n)
> >>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
> >>>>outmat88[i, 1] <- max(row1, na.rm = T)
> >>>>outmat88[i, 2] <- count.rows(temprow37)
> >>>>outmat88[i, 3] <- count.rows(temprow39)
> >>>>outmat88[i, 4] <- count.rows(temprow41)
> >>>>}
> >>>>outmat88
> >>>>
> >>>>We have transferred the data to a more potent Linux box running R,
> >>>>but
> >>>>still hope to speed up the code.
> >>>>
> >>>>I know a for loop should be avoided when looking for speed. I also
> >>>>know
> >>>>the answer is in something like tapply, but my understanding of
> >>>>these
> >>>>commands is still to limited to see the solution. Could someone 
> >>>>show
> >>>>me
> >>>>the way!?
> >>>>
> >>>>Thanks in advance,
> >>>>
> >>>>Sander.
> >>>>--
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>