[R] Efficiency Question - Nested lapply or nested for loop

Fri Oct 8 18:28:16 CEST 2010

You are loosing a lot of time by repeatedly calculating character  
indices with paste() in every iteration. Two options:

-- 1) calculate these once outside the loop and then refer to them by  
index

idx.names <- vector(mode="character", length=nind)
for (i in (0:(nind-1))) {idx[i+1] <-    # need the offset
        c(paste("G_hat_0_",i,sep=""),
         paste("G_hat_1_",i,sep=""),
	paste("G_hat_2_",i,sep=""),
         paste("G_",i,sep="") ) }

Then the inner loop would be:
for (i in (0:(nind-1))) {
       Gmax = which.max(c(data[[ idx.names[1] ]][row],
                          data[[ idx.names[2] ]][row],
                          data[[ idx.names[3] ]][row] ))

	Gtru = data[[ idx.names[4] ]][row] + 1	# add 1 to match Gmax range
                        }

And as has been said many times before,...
require(fortunes)
fortune("dog")

-- 2) probably even faster to pre-calculate (or just construct by  
inspection) those column indices as a numeric vector and use then  
access with data[row, numidxs[i] ]

The for-loop is generally going to be faster than an lapply solution.  
The fastest solution would be a fully indexed strategy, which might  
become more apparent (it's not yet so to me) after you implement the  
second option above.

-- 
David.

On Oct 8, 2010, at 11:35 AM, epowell wrote:

>
> My data looks like this:
>
>> data
>  name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1  
> G_hat_2_1 G_1
> 1  rs0  0.488000  0.448625  0.063375   1  0.480875  0.454500   
> 0.064625   1
> 2  rs1  0.002375  0.955375  0.042250   1  0.000000  0.062875   
> 0.937125   2
> 3  rs2  0.050375  0.835875  0.113750   1  0.877250  0.115875   
> 0.006875   0
> 4  rs3  0.000000  0.074750  0.925250   2  0.897750  0.102000   
> 0.000250   0
> 5  rs4  0.000125  0.052375  0.947500   2  0.261500  0.724125   
> 0.014375   1
> 6  rs5  0.003750  0.092125  0.904125   2  0.023000  0.738125   
> 0.238875   1
>
> And my task is:
> For each individual (X) on each row, to find the index corresponding  
> to the
> max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell  
> of the
> confusion matrix with the row corresponding to that index and the  
> column
> corresponding to G_X.
>
> For example, in the first row and the first individual, the index  
> with the
> max value (0.488000) is 0 and the G_0 value is 1, so I would increment
> matrix index of the first row and second column. (Note that the ranges
> between rows and columns are one off.  That is accounted for in the  
> code.)
>
> In reality the data will be much bigger, containing 10000 rows and a
> variable number of columns (inds) between 10 and 500.
>
> The correct result is:
>
>> cmat
>        tru_rr tru_rv tru_vv
> call_rr      2      2      0
> call_rv      0      4      0
> call_vv      0      0      4
>
> I am not sure what the best way to do this is.  I implemented it  
> once using
> two for loops.  Then I tried to use lapply and came up with a nested  
> lapply
> solution, but it was slower than the simple loops.  I still think  
> that there
> is a better way and I was hoping for some advice.  Perhaps something  
> with
> pmax....
>
> #### DATA PREP ##########
>
> data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"),
> 	G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375),
> 	G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125),
> 	G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125),
> 	G_0=c(1,1,1,2,2,2),
> 	G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023),
> 	G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125),
> 	G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875),
> 	G_1=c(1,2,0,0,1,1))	
>
> # get list of inds in file (e.g. G_0,G_1,...,G_100)
> inds = grep("G_[0-9]+",names(data),perl=T,value=T)
>
> # get total number of inds
> nind = length(inds)
>
> # create an empty "confusion" table
> cmat = matrix(rep(0,9), nrow=3, ncol=3)
> colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv")
> rownames(cmat) = c("call_rr","call_rv","call_vv")
>
> ## APPROACH 1: Nested For Loop ####
>
> # Nested Loop Approach
> for (row in (1:nrow(data))) {
> for (i in (0:(nind-1))) {
>
> 	Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row],
> 				  data[[paste("G_hat_1_",i,sep="")]][row],
> 				  data[[paste("G_hat_2_",i,sep="")]][row] ))
>
> 	Gtru = data[[paste("G_",i,sep="")]][row] + 1	# add 1 to match Gmax  
> range
>
> 	cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1
> }
> }
>
>
> ## APPROACH 2: Nested lapply ####
>
> # This routine finds the geno w/ highest prob from the erg.avgs.
> # and compares it to the true geno. Result is tallied by 		
> # incrementing the appropriate index of the confusion matrix 	
>
> add2cmat <- function(ind,locus) {
>
> 	Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus],
> 				  data[[paste("G_hat_1_",ind,sep="")]][locus],
> 				  data[[paste("G_hat_2_",ind,sep="")]][locus] ))
>
> 	Gtru = data[[paste("G_",ind,sep="")]][locus] + 1	# add 1 to match  
> Gmax
> range
>
> 	cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1			# use double arrow to  
> modify
> global env.
>
> }
>
> # Run add2cmat for all individuals on a given locus
>
> add_locus2cmat <- function(locus) {
> 	lapply(0:(nind-1),add2cmat,locus)
> }
>
> junk = lapply((1:nrow(data)),add_locus2cmat)  # don't need return  
> value
>
>
>
> -- 
> View this message in context: http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT