[R] Replacing for loop with tapply!?
Adaikalavan Ramasamy
ramasamy at cancer.org.uk
Sat Jun 11 02:36:43 CEST 2005
OK, so you want to find some summary statistics for each column, where
some columns could be completely missing.
Writing a small wrapper should help. When you use apply(), you are
actually applying a function to every column (or row). First, let us
simulate a dataset with 15 days/rows and 10 stations/columns
### simulate data
set.seed(1) # for reproducibility
mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
mat[ mat > 45 ] <- NA # create some missing values
mat[ ,9 ] <- NA # station 9's data is completely missing
Here are two example of such wrappers :
find.stats1 <- function( data, threshold=c(37,39,41) ){
n <- length(threshold)
out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
out[1, ] <- apply(data, 2, function(x)
ifelse( all(is.na(x)), NA, max(x, na.rm=T) ))
for(i in 1:n) out[ i+1, ] <- colSums( data > threshold[i], na.rm=T )
rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
colnames(out) <- rownames(data) # name of the stations
return( out )
}
find.stats2 <- function( data, threshold=c(37,39,41) ){
n <- length(threshold)
excess <- numeric( n )
out <- matrix( nrow=(n + 1), ncol=ncol(data) ) # initialise
good <- which( apply( data, 2, function(x) !all(is.na(x)) ) )
# colums that are not completely missing
out[ , good] <- apply( data[ , good], 2, function(x){
m <- max( x, na.rm=T )
for(i in 1:n){ excess[i] <- sum( x > threshold[i], na.rm=TRUE ) }
return( c(m, excess) )
} )
rownames(out) <- c( "daily_max", paste("above", threshold, sep="_") )
colnames(out) <- rownames(data) # name of the stations
return( out )
}
find.stats1( mat )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max 44 42 39 41 45 43 42 45 NA 42
above_37 2 1 2 1 3 2 2 1 0 1
above_39 2 1 0 1 3 2 1 1 0 1
above_41 2 1 0 0 2 2 1 1 0 1
find.stats2( mat )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
daily_max 44 42 39 41 45 43 42 45 NA 42
above_37 2 1 2 1 3 2 2 1 NA 1
above_39 2 1 0 1 3 2 1 1 NA 1
above_41 2 1 0 0 2 2 1 1 NA 1
On my laptop 'find.stats1' and 'find.stats2' (which is more flexible)
takes 7 and 6 seconds respectively to execute on a dataset with 10000
stations and 365 days.
Regards, Adai
On Fri, 2005-06-10 at 20:05 +0200, Sander Oom wrote:
> Dear all,
>
> Dimitris and Andy, thanks for your great help. I have progressed to the
> following code which runs very fast and effective:
>
> mat <- matrix(sample(-15:50, 15 * 10, TRUE), 15, 10)
> mat[mat>45] <- NA
> mat<-NA
> mat
> temps <- c(35, 37, 39)
> ind <- rbind(
> t(sapply(temps, function(temp)
> rowSums(mat > temp, na.rm=TRUE) )),
> rowSums(!is.na(mat), na.rm=FALSE),
> apply(mat, 1, max, na.rm=TRUE))
> ind <- t(ind)
> ind
>
> However, some weather stations have missing values for the whole year.
> Unfortunately, the code breaks down (when uncommenting mat<-NA).
>
> I have tried 'ifelse' statements in the functions, but it becomes even
> more of a mess. I could subset the matrix before hand, but this would
> mean merging with a complete matrix afterwards to make it compatible
> with other years. That would slow things down.
>
> How can I make the code robust for rows containing all missing values?
>
> Thanks for your help,
>
> Sander.
>
> Dimitris Rizopoulos wrote:
> > for the maximum you could use something like:
> >
> > ind[, 1] <- apply(mat, 2, max)
> >
> > I hope it helps.
> >
> > Best,
> > Dimitris
> >
> > ----
> > Dimitris Rizopoulos
> > Ph.D. Student
> > Biostatistical Centre
> > School of Public Health
> > Catholic University of Leuven
> >
> > Address: Kapucijnenvoer 35, Leuven, Belgium
> > Tel: +32/16/336899
> > Fax: +32/16/337015
> > Web: http://www.med.kuleuven.ac.be/biostat/
> > http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> >
> >
> >
> > ----- Original Message -----
> > From: "Sander Oom" <slist at oomvanlieshout.net>
> > To: "Dimitris Rizopoulos" <dimitris.rizopoulos at med.kuleuven.be>
> > Cc: <r-help at stat.math.ethz.ch>
> > Sent: Friday, June 10, 2005 12:10 PM
> > Subject: Re: [R] Replacing for loop with tapply!?
> >
> >
> >>Thanks Dimitris,
> >>
> >>Very impressive! Much faster than before.
> >>
> >>Thanks to new found R.basic, I can simply rotate the result with
> >>rotate270{R.basic}:
> >>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>#ind <- matrix(0, length(temps), ncol(mat))
> >>>ind <- matrix(0, 4, ncol(mat))
> >>>(startDate <- date())
> >>[1] "Fri Jun 10 12:08:01 2005"
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind[4, ] <- colMeans(max(mat))
> >>Error in colMeans(max(mat)) : 'x' must be an array of at least two
> >>dimensions
> >>>(endDate <- date())
> >>[1] "Fri Jun 10 12:08:02 2005"
> >>>ind <- rotate270(ind)
> >>>ind[1:10,]
> >> V4 V3 V2 V1
> >>1 0 56 75 80
> >>2 0 46 53 60
> >>3 0 50 58 67
> >>4 0 60 72 80
> >>5 0 59 68 76
> >>6 0 55 67 74
> >>7 0 62 77 93
> >>8 0 45 57 67
> >>9 0 57 68 75
> >>10 0 61 66 76
> >>
> >>However, I have not managed to get the row maximum using your
> >>method? It
> >>should be 50 for most rows, but my first guess code gives an error!
> >>
> >>Any suggestions?
> >>
> >>Sander
> >>
> >>
> >>
> >>Dimitris Rizopoulos wrote:
> >>>maybe you are looking for something along these lines:
> >>>
> >>>mat <- matrix(sample(-15:50, 365 * 15000, TRUE), 365, 15000)
> >>>temps <- c(37, 39, 41)
> >>>#################
> >>>ind <- matrix(0, length(temps), ncol(mat))
> >>>for(i in seq(along = temps)) ind[i, ] <- colSums(mat > temps[i])
> >>>ind
> >>>
> >>>
> >>>I hope it helps.
> >>>
> >>>Best,
> >>>Dimitris
> >>>
> >>>----
> >>>Dimitris Rizopoulos
> >>>Ph.D. Student
> >>>Biostatistical Centre
> >>>School of Public Health
> >>>Catholic University of Leuven
> >>>
> >>>Address: Kapucijnenvoer 35, Leuven, Belgium
> >>>Tel: +32/16/336899
> >>>Fax: +32/16/337015
> >>>Web: http://www.med.kuleuven.ac.be/biostat/
> >>> http://www.student.kuleuven.ac.be/~m0390867/dimitris.htm
> >>>
> >>>
> >>>----- Original Message -----
> >>>From: "Sander Oom" <slist at oomvanlieshout.net>
> >>>To: <r-help at stat.math.ethz.ch>
> >>>Sent: Friday, June 10, 2005 10:50 AM
> >>>Subject: [R] Replacing for loop with tapply!?
> >>>
> >>>
> >>>>Dear all,
> >>>>
> >>>>We have a large data set with temperature data for weather stations
> >>>>across the globe (15000 stations).
> >>>>
> >>>>For each station, we need to calculate the number of days a certain
> >>>>temperature is exceeded.
> >>>>
> >>>>So far we used the following S code, where mat88 is a matrix
> >>>>containing
> >>>>rows of 365 daily temperatures for each of 15000 weather stations:
> >>>>
> >>>>m <- 37
> >>>>n <- 2
> >>>>outmat88 <- matrix(0, ncol = 4, nrow = nrow(mat88))
> >>>>for(i in 1:nrow(mat88)) {
> >>>># i <- 3
> >>>>row1 <- as.data.frame(df88[i, ])
> >>>>temprow37 <- select.rows(row1, row1 > m)
> >>>>temprow39 <- select.rows(row1, row1 > m + n)
> >>>>temprow41 <- select.rows(row1, row1 > m + 2 * n)
> >>>>outmat88[i, 1] <- max(row1, na.rm = T)
> >>>>outmat88[i, 2] <- count.rows(temprow37)
> >>>>outmat88[i, 3] <- count.rows(temprow39)
> >>>>outmat88[i, 4] <- count.rows(temprow41)
> >>>>}
> >>>>outmat88
> >>>>
> >>>>We have transferred the data to a more potent Linux box running R,
> >>>>but
> >>>>still hope to speed up the code.
> >>>>
> >>>>I know a for loop should be avoided when looking for speed. I also
> >>>>know
> >>>>the answer is in something like tapply, but my understanding of
> >>>>these
> >>>>commands is still to limited to see the solution. Could someone
> >>>>show
> >>>>me
> >>>>the way!?
> >>>>
> >>>>Thanks in advance,
> >>>>
> >>>>Sander.
> >>>>--
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
More information about the R-help
mailing list