[R] Replacing for loop with tapply!?
Sander Oom
slist at oomvanlieshout.net
Fri Jun 10 20:05:57 CEST 2005
Dear all,
Dimitris and Andy, thanks for your great help. I have progressed to the
following code which runs very fast and effective:
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[mat>45] <- NA
mat<-NA
mat
temps <- c(35, 37, 39)
ind <- rbind(
t(sapply(temps, function(temp)
rowSums(mat > temp, na.rm=TRUE) )),
rowSums(!is.na(mat), na.rm=FALSE),
apply(mat, 1, max, na.rm=TRUE))
ind <- t(ind)
ind
However, some weather stations have missing values for the whole year.
Unfortunately, the code breaks down (when uncommenting mat<-NA).
I have tried 'ifelse' statements in the functions, but it becomes even
more of a mess. I could subset the matrix before hand, but this would
mean merging with a complete matrix afterwards to make it compatible
with other years. That would slow things down.
How can I make the code robust for rows containing all missing values?
Thanks for your help,
Sander.
Dimitris Rizopoulos wrote:
> for the maximum you could use something like:
>
> ind[, 1] <- apply(mat, 2, max)
>
> I hope it helps.
>
> Best,
> Dimitris
>
> ----
> Dimitris Rizopoulos
> Ph.D. Student
> Biostatistical Centre
> School of Public Health
> Catholic University of Leuven
>
> Address: Kapucijnenvoer 35, Leuven, Belgium
> Tel: +32/16/336899
> Fax: +32/16/337015
> Web: http://www.med.kuleuven.ac.be/biostat/
> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>
>
>
> ----- Original Message -----
> From: "Sander Oom" <slist at oomvanlieshout.net>
> To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
> Cc: <r-help at stat.math.ethz.ch>
> Sent: Friday, June 10, 2005 12:10 PM
> Subject: Re: [R] Replacing for loop with tapply!?
>
>
>>Thanks Dimitris,
>>
>>Very impressive! Much faster than before.
>>
>>Thanks to new found R.basic, I can simply rotate the result with
>>rotate270{R.basic}:
>>
>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>temps <- c(37, 39, 41)
>>>#################
>>>#ind <- matrix(0, length(temps), ncol(mat))
>>>ind <- matrix(0, 4, ncol(mat))
>>>(startDate <- date())
>>[1] "Fri Jun 10 12:08:01 2005"
>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>ind[4, ] <- colMeans(max(mat))
>>Error in colMeans(max(mat)) : 'x' must be an array of at least two
>>dimensions
>>>(endDate <- date())
>>[1] "Fri Jun 10 12:08:02 2005"
>>>ind <- rotate270(ind)
>>>ind[1:10,]
>> V4 V3 V2 V1
>>1 0 56 75 80
>>2 0 46 53 60
>>3 0 50 58 67
>>4 0 60 72 80
>>5 0 59 68 76
>>6 0 55 67 74
>>7 0 62 77 93
>>8 0 45 57 67
>>9 0 57 68 75
>>10 0 61 66 76
>>
>>However, I have not managed to get the row maximum using your
>>method? It
>>should be 50 for most rows, but my first guess code gives an error!
>>
>>Any suggestions?
>>
>>Sander
>>
>>
>>
>>Dimitris Rizopoulos wrote:
>>>maybe you are looking for something along these lines:
>>>
>>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
>>>temps <- c(37, 39, 41)
>>>#################
>>>ind <- matrix(0, length(temps), ncol(mat))
>>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
>>>ind
>>>
>>>
>>>I hope it helps.
>>>
>>>Best,
>>>Dimitris
>>>
>>>----
>>>Dimitris Rizopoulos
>>>Ph.D. Student
>>>Biostatistical Centre
>>>School of Public Health
>>>Catholic University of Leuven
>>>
>>>Address: Kapucijnenvoer 35, Leuven, Belgium
>>>Tel: +32/16/336899
>>>Fax: +32/16/337015
>>>Web: http://www.med.kuleuven.ac.be/biostat/
>>> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
>>>
>>>
>>>----- Original Message -----
>>>From: "Sander Oom" <slist at oomvanlieshout.net>
>>>To: <r-help at stat.math.ethz.ch>
>>>Sent: Friday, June 10, 2005 10:50 AM
>>>Subject: [R] Replacing for loop with tapply!?
>>>
>>>
>>>>Dear all,
>>>>
>>>>We have a large data set with temperature data for weather stations
>>>>across the globe (15000 stations).
>>>>
>>>>For each station, we need to calculate the number of days a certain
>>>>temperature is exceeded.
>>>>
>>>>So far we used the following S code, where mat88 is a matrix
>>>>containing
>>>>rows of 365 daily temperatures for each of 15000 weather stations:
>>>>
>>>>m <- 37
>>>>n <- 2
>>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
>>>>for(i in 1:nrow(mat88)) {
>>>># i <- 3
>>>>row1 <- as.data.frame(df88[i, ])
>>>>temprow37 <- select.rows(row1, row1 > m)
>>>>temprow39 <- select.rows(row1, row1 > m + n)
>>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
>>>>outmat88[i, 1] <- max(row1, na.rm = T)
>>>>outmat88[i, 2] <- count.rows(temprow37)
>>>>outmat88[i, 3] <- count.rows(temprow39)
>>>>outmat88[i, 4] <- count.rows(temprow41)
>>>>}
>>>>outmat88
>>>>
>>>>We have transferred the data to a more potent Linux box running R,
>>>>but
>>>>still hope to speed up the code.
>>>>
>>>>I know a for loop should be avoided when looking for speed. I also
>>>>know
>>>>the answer is in something like tapply, but my understanding of
>>>>these
>>>>commands is still to limited to see the solution. Could someone
>>>>show
>>>>me
>>>>the way!?
>>>>
>>>>Thanks in advance,
>>>>
>>>>Sander.
>>>>--
More information about the R-help
mailing list