[R] Efficiency Question - Nested lapply or nested for loop
David Winsemius
dwinsemius at comcast.net
Fri Oct 8 18:28:16 CEST 2010
You are loosing a lot of time by repeatedly calculating character
indices with paste() in every iteration. Two options:
-- 1) calculate these once outside the loop and then refer to them by
index
idx.names <- vector(mode="character", length=nind)
for (i in (0:(nind-1))) {idx[i+1] <- # need the offset
c(paste("G_hat_0_",i,sep=""),
paste("G_hat_1_",i,sep=""),
paste("G_hat_2_",i,sep=""),
paste("G_",i,sep="") ) }
Then the inner loop would be:
for (i in (0:(nind-1))) {
Gmax = which.max(c(data[[ idx.names[1] ]][row],
data[[ idx.names[2] ]][row],
data[[ idx.names[3] ]][row] ))
Gtru = data[[ idx.names[4] ]][row] + 1 # add 1 to match Gmax range
}
And as has been said many times before,...
require(fortunes)
fortune("dog")
-- 2) probably even faster to pre-calculate (or just construct by
inspection) those column indices as a numeric vector and use then
access with data[row, numidxs[i] ]
The for-loop is generally going to be faster than an lapply solution.
The fastest solution would be a fully indexed strategy, which might
become more apparent (it's not yet so to me) after you implement the
second option above.
--
David.
On Oct 8, 2010, at 11:35 AM, epowell wrote:
>
> My data looks like this:
>
>> data
> name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1
> G_hat_2_1 G_1
> 1 rs0 0.488000 0.448625 0.063375 1 0.480875 0.454500
> 0.064625 1
> 2 rs1 0.002375 0.955375 0.042250 1 0.000000 0.062875
> 0.937125 2
> 3 rs2 0.050375 0.835875 0.113750 1 0.877250 0.115875
> 0.006875 0
> 4 rs3 0.000000 0.074750 0.925250 2 0.897750 0.102000
> 0.000250 0
> 5 rs4 0.000125 0.052375 0.947500 2 0.261500 0.724125
> 0.014375 1
> 6 rs5 0.003750 0.092125 0.904125 2 0.023000 0.738125
> 0.238875 1
>
> And my task is:
> For each individual (X) on each row, to find the index corresponding
> to the
> max of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cell
> of the
> confusion matrix with the row corresponding to that index and the
> column
> corresponding to G_X.
>
> For example, in the first row and the first individual, the index
> with the
> max value (0.488000) is 0 and the G_0 value is 1, so I would increment
> matrix index of the first row and second column. (Note that the ranges
> between rows and columns are one off. That is accounted for in the
> code.)
>
> In reality the data will be much bigger, containing 10000 rows and a
> variable number of columns (inds) between 10 and 500.
>
> The correct result is:
>
>> cmat
> tru_rr tru_rv tru_vv
> call_rr 2 2 0
> call_rv 0 4 0
> call_vv 0 0 4
>
> I am not sure what the best way to do this is. I implemented it
> once using
> two for loops. Then I tried to use lapply and came up with a nested
> lapply
> solution, but it was slower than the simple loops. I still think
> that there
> is a better way and I was hoping for some advice. Perhaps something
> with
> pmax....
>
> #### DATA PREP ##########
>
> data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"),
> G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375),
> G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125),
> G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125),
> G_0=c(1,1,1,2,2,2),
> G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023),
> G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125),
> G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875),
> G_1=c(1,2,0,0,1,1))
>
> # get list of inds in file (e.g. G_0,G_1,...,G_100)
> inds = grep("G_[0-9]+",names(data),perl=T,value=T)
>
> # get total number of inds
> nind = length(inds)
>
> # create an empty "confusion" table
> cmat = matrix(rep(0,9), nrow=3, ncol=3)
> colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv")
> rownames(cmat) = c("call_rr","call_rv","call_vv")
>
> ## APPROACH 1: Nested For Loop ####
>
> # Nested Loop Approach
> for (row in (1:nrow(data))) {
> for (i in (0:(nind-1))) {
>
> Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row],
> data[[paste("G_hat_1_",i,sep="")]][row],
> data[[paste("G_hat_2_",i,sep="")]][row] ))
>
> Gtru = data[[paste("G_",i,sep="")]][row] + 1 # add 1 to match Gmax
> range
>
> cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1
> }
> }
>
>
> ## APPROACH 2: Nested lapply ####
>
> # This routine finds the geno w/ highest prob from the erg.avgs.
> # and compares it to the true geno. Result is tallied by
> # incrementing the appropriate index of the confusion matrix
>
> add2cmat <- function(ind,locus) {
>
> Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus],
> data[[paste("G_hat_1_",ind,sep="")]][locus],
> data[[paste("G_hat_2_",ind,sep="")]][locus] ))
>
> Gtru = data[[paste("G_",ind,sep="")]][locus] + 1 # add 1 to match
> Gmax
> range
>
> cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1 # use double arrow to
> modify
> global env.
>
> }
>
> # Run add2cmat for all individuals on a given locus
>
> add_locus2cmat <- function(locus) {
> lapply(0:(nind-1),add2cmat,locus)
> }
>
> junk = lapply((1:nrow(data)),add_locus2cmat) # don't need return
> value
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list