[R] Efficiency Question - Nested lapply or nested for loop

Fri Oct 8 17:35:47 CEST 2010

My data looks like this: 

> data
  name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1 G_hat_2_1 G_1
1  rs0  0.488000  0.448625  0.063375   1  0.480875  0.454500  0.064625   1
2  rs1  0.002375  0.955375  0.042250   1  0.000000  0.062875  0.937125   2
3  rs2  0.050375  0.835875  0.113750   1  0.877250  0.115875  0.006875   0
4  rs3  0.000000  0.074750  0.925250   2  0.897750  0.102000  0.000250   0
5  rs4  0.000125  0.052375  0.947500   2  0.261500  0.724125  0.014375   1
6  rs5  0.003750  0.092125  0.904125   2  0.023000  0.738125  0.238875   1

And my task is: 
For each individual (X) on each row, to find the index corresponding to the
max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell of the
confusion matrix with the row corresponding to that index and the column
corresponding to G_X.

For example, in the first row and the first individual, the index with the
max value (0.488000) is 0 and the G_0 value is 1, so I would increment
matrix index of the first row and second column. (Note that the ranges
between rows and columns are one off.  That is accounted for in the code.)

In reality the data will be much bigger, containing 10000 rows and a
variable number of columns (inds) between 10 and 500. 

The correct result is:

> cmat
        tru_rr tru_rv tru_vv
call_rr      2      2      0
call_rv      0      4      0
call_vv      0      0      4

I am not sure what the best way to do this is.  I implemented it once using
two for loops.  Then I tried to use lapply and came up with a nested lapply
solution, but it was slower than the simple loops.  I still think that there
is a better way and I was hoping for some advice.  Perhaps something with
pmax....

#### DATA PREP ##########

data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"),
	G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375),
	G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125),
	G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125),
	G_0=c(1,1,1,2,2,2),
	G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023),
	G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125),
	G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875),
	G_1=c(1,2,0,0,1,1))	

# get list of inds in file (e.g. G_0,G_1,...,G_100)
inds = grep("G_[0-9]+",names(data),perl=T,value=T)

# get total number of inds
nind = length(inds)

# create an empty "confusion" table
cmat = matrix(rep(0,9), nrow=3, ncol=3)
colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv")
rownames(cmat) = c("call_rr","call_rv","call_vv")

## APPROACH 1: Nested For Loop ####

# Nested Loop Approach
for (row in (1:nrow(data))) {
for (i in (0:(nind-1))) { 

	Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row],
				  data[[paste("G_hat_1_",i,sep="")]][row],
				  data[[paste("G_hat_2_",i,sep="")]][row] ))

	Gtru = data[[paste("G_",i,sep="")]][row] + 1	# add 1 to match Gmax range

	cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1
}
}

## APPROACH 2: Nested lapply ####

# This routine finds the geno w/ highest prob from the erg.avgs. 
# and compares it to the true geno. Result is tallied by 		
# incrementing the appropriate index of the confusion matrix 	

add2cmat <- function(ind,locus) { 

	Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus],
				  data[[paste("G_hat_1_",ind,sep="")]][locus],
				  data[[paste("G_hat_2_",ind,sep="")]][locus] ))

	Gtru = data[[paste("G_",ind,sep="")]][locus] + 1	# add 1 to match Gmax
range

	cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1			# use double arrow to modify
global env.

}

# Run add2cmat for all individuals on a given locus

add_locus2cmat <- function(locus) { 
	lapply(0:(nind-1),add2cmat,locus) 
}

junk = lapply((1:nrow(data)),add_locus2cmat)  # don't need return value

-- 
View this message in context: http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html
Sent from the R help mailing list archive at Nabble.com.