[R] Timing benefits of mapply() vs. for loop was: Wrap a loop inside a function
Frank E Harrell Jr
f.harrell at vanderbilt.edu
Thu Jul 20 15:13:02 CEST 2006
Gabor Grothendieck wrote:
> Note that if you use mapply in the way I suggested, which is not
> the same as in your post, then its just as fast. (Also the version
> of mapply in your post gives different numerical results than
> the for loop whereas mine gives the same.) like.mat is the for
> loop version, like.mat2 is your mapply version and like.mat3
> is my mapply version.
You might also test mApply in Hmisc.
Cheers
Frank
>
>> like.mat <- function(score, items, theta){
> + like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
> + length(theta))
> + for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
> score[[i]])
> + like.mat
> + }
>> system.time(for(i in 1:100) like.mat(score,items,theta))
> [1] 1.30 0.00 1.34 NA NA
>> like.mat2 <- function(score, items, theta)
> + matrix(mapply(pcm, rep(theta,length(items)), items, score),
> + ncol = length(theta), byrow = TRUE)
>> system.time(for(i in 1:100) like.mat2(score,items,theta))
> [1] 5.70 0.00 5.91 NA NA
>> all.equal(like.mat(score, items, theta), like.mat2(score, items, theta))
> [1] "Mean relative difference: 1.268095"
>> like.mat3 <- function(score, items, theta)
> + t(mapply(pcm, items, score, MoreArgs = list(theta = theta)))
>> system.time(for(i in 1:100) like.mat3(score,items,theta))
> [1] 1.32 0.01 1.39 NA NA
>> m3 <- like.mat3(score, items, theta)
>> dimnames(m3) <- NULL
>> all.equal(like.mat(score, items, theta), m3)
> [1] TRUE
>
> On 7/20/06, Doran, Harold <HDoran at air.org> wrote:
>>
>>
>> List:
>>
>> Thank you for the replies to my post yesterday. Gabor and Phil also gave
>> useful replies on how to improve the function by relying on mapply rather
>> than the explicit for loop. In general, I try and use the family of apply
>> functions rather than the looping constructs such as for, while etc as a
>> matter of practice.
>>
>> However, it seems the mapply function in this case is slower (in terms of
>> CPU speed) than the for loop. Here is an example.
>>
>> # data needed for example
>>
>> items <- list(item1 = c(0,1,2), item2 = c(0,1), item3 = c(0,1,2,3,4), item4
>> = c(0,1), item5=c(0,1,2,3,4),
>> item6=c(0,1,2,3))
>> score <- c(2,1,3,1,3,2)
>> theta <- c(-1,-.5,0,.5,1)
>>
>> # My old function using the for loop
>>
>> like.mat <- function(score, items, theta){
>> like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
>> length(theta))
>> for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
>> score[[i]])
>> like.mat
>> }
>>
>> system.time(like.mat(score,items,theta))
>> [1] 0 0 0 NA NA
>>
>> # Revised using mapply
>>
>> like.mat <- function(score, items, theta){
>> matrix(mapply(pcm,rep(theta,length(items)),items,score),ncol=length(theta),byrow=TRUE)
>> }
>>
>>> system.time(like.mat(score,items,theta))
>> [1] 0.03 0.00 0.03 NA NA
>>
>>
>> It is obviously slower to use mapply, but nominaly. So, let's actually look
>> at this within the context of the full program I am working on. For context,
>> I am evaluating an integral using Gaussian quadrature. This is a
>> psychometric problem where the function 'pcm" is Master's partial credit
>> model and 'score' is the student's score on test item i. When an item has
>> two categories (0,1), pcm reduces to the Rasch model for dichotomous data.
>> The dnorm is set at N(0,1) by default, but the parameters of the population
>> distribution are estimated from a separate procedure and are normally input
>> into the function, but this default works for the example.
>>
>> Here is the full program.
>>
>> library(statmod)
>>
>> # Master's partial credit model
>> pcm <- function(theta,d,score){
>> exp(rowSums(outer(theta,d[1:score],'-')))/
>> apply(exp(apply(outer(theta,d, '-'), 1, cumsum)), 2, sum)
>> }
>>
>> like.mat <- function(score, items, theta){
>> like.mat <- matrix(numeric(length(items) * length(theta)), ncol =
>> length(theta))
>> for(i in 1:length(items)) like.mat[i, ] <- pcm(theta, items[[i]],
>> score[[i]])
>> like.mat
>> }
>>
>> # turn this off for now
>> #like.mat <- function(score, items, theta){
>> #matrix(mapply(pcm,rep(theta,length(items)),items,score),ncol=length(theta),byrow=TRUE)
>> #}
>>
>> class.numer <- function(score,items, prof_cut, mu=0, sigma=1, aboveQ){
>> gauss_numer <- gauss.quad(49,kind="laguerre")
>> if(aboveQ==FALSE){
>> mat <- rbind(like.mat(score,items, (prof_cut-gauss_numer$nodes)),
>> dnorm(prof_cut-gauss_numer$nodes, mean=mu, sd=sigma))
>>
>> } else { mat <- rbind(like.mat(score,items,
>> (gauss_numer$nodes+prof_cut)),
>> dnorm(gauss_numer$nodes+prof_cut, mean=mu,
>> sd=sigma))
>>
>> }
>> f_y <- rbind(apply(mat, 2, prod), exp(gauss_numer$nodes),
>> gauss_numer$weights)
>> sum(apply(f_y,2,prod))
>> }
>>
>> class.denom <- function(score,items, mu=0, sigma=1){
>> gauss_denom <- gauss.quad.prob(49, dist='normal', mu=mu, sigma=sigma)
>> mat <-
>> rbind(like.mat(score,items,gauss_denom$nodes),gauss_denom$weights)
>> sum(apply(mat, 2, prod))
>> }
>>
>> class.acc <-function(score,items,prof_cut, mu=0, sigma=1,
>> aboveQ=TRUE){
>> result <- class.numer(score,items,prof_cut, mu,sigma,
>> aboveQ)/class.denom(score,items, mu, sigma)
>> return(result)
>> }
>>
>> # Test the function "class.acc"
>> items <- list(item1 = c(0,1,2), item2 = c(0,1), item3 = c(0,1,2,3,4), item4
>> = c(0,1), item5=c(0,1,2,3,4),
>> item6=c(0,1,2,3))
>> score <- c(2,1,3,1,3,2)
>>
>> # This is the system time when using the for loop for the like.mat function
>> system.time(class.acc(score,items,1,aboveQ=T))
>> [1] 0.04 0.00 0.04 NA NA
>>
>> # This is the system time using the mapply for the like.mat function
>> system.time(class.acc(score,items,1,aboveQ=T))
>> [1] 0.70 0.00 0.73 NA NA
>>
>>
>> There is a substantial improvement in CPU seconds when the for loop is
>> applied rather than using the mapply function. I experimented with adding
>> more items and varying the quadrature points and it always turned out the
>> for loop was faster.
>>
>> Given this result, I wonder what advice might be offered. Is there an
>> inherent reason one might opt for the mapply function (such as reliability,
>> etc) even when it compromises computational speed? Or, should the issue of
>> computational speed be considered ahead of common advice to rely on the
>> family of apply functions instead of the explicit loops.
>>
>> Thanks for your consideration of my question,
>> Harold
More information about the R-help
mailing list