[R] Replacing for loop with tapply!?

Sander Oom slist at oomvanlieshout.net
Fri Jun 10 20:05:57 CEST 2005


Dear all,

Dimitris and Andy, thanks for your great help. I have progressed to the 
following code which runs very fast and effective:

mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[mat>45] <- NA
mat<-NA
mat
temps <- c(35, 37, 39)
ind <- rbind(
     t(sapply(temps, function(temp)
       rowSums(mat > temp, na.rm=TRUE) )),
     rowSums(!is.na(mat), na.rm=FALSE),
     apply(mat, 1, max, na.rm=TRUE))
ind <- t(ind)
ind

However, some weather stations have missing values for the whole year. 
Unfortunately, the code breaks down (when uncommenting mat<-NA).

I have tried 'ifelse' statements in the functions, but it becomes even 
more of a mess. I could subset the matrix before hand, but this would 
mean merging with a complete matrix afterwards to make it compatible 
with other years. That would slow things down.

How can I make the code robust for rows containing all missing values?

Thanks for your help,

Sander.

Dimitris Rizopoulos wrote:
> for the maximum you could use something like:
> 
> ind[, 1] <- apply(mat, 2, max)
> 
> I hope it helps.
> 
> Best,
> Dimitris
> 
> ----
> Dimitris Rizopoulos
> Ph.D. Student
> Biostatistical Centre
> School of Public Health
> Catholic University of Leuven
> 
> Address: Kapucijnenvoer 35, Leuven, Belgium
> Tel: +32/16/336899
> Fax: +32/16/337015
> Web: http://www.med.kuleuven.ac.be/biostat/
>      http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> 
> 
> 
> ----- Original Message ----- 
> From: "Sander Oom" <slist at oomvanlieshout.net>
> To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
> Cc: <r-help at stat.math.ethz.ch>
> Sent: Friday, June 10, 2005 12:10 PM
> Subject: Re: [R] Replacing for loop with tapply!?
> 
> 
>>Thanks Dimitris,
>>
>>Very impressive! Much faster than before.
>>
>>Thanks to new found R.basic, I can simply rotate the result with
>>rotate270{R.basic}:
>>
>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>temps <- c(37, 39, 41)
>>>#################
>>>#ind <- matrix(0, length(temps), ncol(mat))
>>>ind <- matrix(0, 4, ncol(mat))
>>>(startDate <- date())
>>[1] "Fri Jun 10 12:08:01 2005"
>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>ind[4, ] <- colMeans(max(mat))
>>Error in colMeans(max(mat)) : 'x' must be an array of at least two
>>dimensions
>>>(endDate <- date())
>>[1] "Fri Jun 10 12:08:02 2005"
>>>ind <- rotate270(ind)
>>>ind[1:10,]
>>   V4 V3 V2 V1
>>1   0 56 75 80
>>2   0 46 53 60
>>3   0 50 58 67
>>4   0 60 72 80
>>5   0 59 68 76
>>6   0 55 67 74
>>7   0 62 77 93
>>8   0 45 57 67
>>9   0 57 68 75
>>10  0 61 66 76
>>
>>However, I have not managed to get the row maximum using your 
>>method? It
>>should be 50 for most rows, but my first guess code gives an error!
>>
>>Any suggestions?
>>
>>Sander
>>
>>
>>
>>Dimitris Rizopoulos wrote:
>>>maybe you are looking for something along these lines:
>>>
>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>temps <- c(37, 39, 41)
>>>#################
>>>ind <- matrix(0, length(temps), ncol(mat))
>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>ind
>>>
>>>
>>>I hope it helps.
>>>
>>>Best,
>>>Dimitris
>>>
>>>----
>>>Dimitris Rizopoulos
>>>Ph.D. Student
>>>Biostatistical Centre
>>>School of Public Health
>>>Catholic University of Leuven
>>>
>>>Address: Kapucijnenvoer 35, Leuven, Belgium
>>>Tel: +32/16/336899
>>>Fax: +32/16/337015
>>>Web: http://www.med.kuleuven.ac.be/biostat/
>>>     http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>>>
>>>
>>>----- Original Message ----- 
>>>From: "Sander Oom" <slist at oomvanlieshout.net>
>>>To: <r-help at stat.math.ethz.ch>
>>>Sent: Friday, June 10, 2005 10:50 AM
>>>Subject: [R] Replacing for loop with tapply!?
>>>
>>>
>>>>Dear all,
>>>>
>>>>We have a large data set with temperature data for weather stations
>>>>across the globe (15000 stations).
>>>>
>>>>For each station, we need to calculate the number of days a certain
>>>>temperature is exceeded.
>>>>
>>>>So far we used the following S code, where mat88 is a matrix
>>>>containing
>>>>rows of 365 daily temperatures for each of 15000 weather stations:
>>>>
>>>>m <- 37
>>>>n <- 2
>>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
>>>>for(i in 1:nrow(mat88)) {
>>>># i <- 3
>>>>row1 <- as.data.frame(df88[i,  ])
>>>>temprow37 <- select.rows(row1, row1 > m)
>>>>temprow39 <- select.rows(row1, row1 > m + n)
>>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
>>>>outmat88[i, 1] <- max(row1, na.rm = T)
>>>>outmat88[i, 2] <- count.rows(temprow37)
>>>>outmat88[i, 3] <- count.rows(temprow39)
>>>>outmat88[i, 4] <- count.rows(temprow41)
>>>>}
>>>>outmat88
>>>>
>>>>We have transferred the data to a more potent Linux box running R,
>>>>but
>>>>still hope to speed up the code.
>>>>
>>>>I know a for loop should be avoided when looking for speed. I also
>>>>know
>>>>the answer is in something like tapply, but my understanding of
>>>>these
>>>>commands is still to limited to see the solution. Could someone 
>>>>show
>>>>me
>>>>the way!?
>>>>
>>>>Thanks in advance,
>>>>
>>>>Sander.
>>>>--




More information about the R-help mailing list