[R] speeding up likelihood computation
Deepankar Basu
basu.15 at osu.edu
Mon Dec 3 18:37:36 CET 2007
Hi Jim,
Thanks for your suggestion. My guess is that most of the time is taken
by various kinds of assignments that I am making using loops; and these
can be done without loops. In a follow-up post, I will try to explain in
greater detail each part of my code (especially the parts where
assignments are being made) with comments.
Meanwhile I am copying the output from Rprof below. Any suggestions on
how to increase efficiency on the basis of this output?
> summaryRprof("example1.out")
$by.self
self.time self.pct total.time total.pct
fn 419.64 74.2 564.72 99.9
pnorm 28.16 5.0 35.02 6.2
matrix 25.58 4.5 213.26 37.7
sum 17.02 3.0 17.02 3.0
* 10.90 1.9 10.90 1.9
exp 7.76 1.4 7.76 1.4
factorial 7.38 1.3 11.16 2.0
^ 7.20 1.3 7.20 1.3
- 6.50 1.1 6.50 1.1
as.vector 5.12 0.9 196.52 34.8
vector 4.88 0.9 4.88 0.9
+ 3.66 0.6 3.66 0.6
( 3.42 0.6 3.42 0.6
> 2.78 0.5 2.78 0.5
gamma 2.68 0.5 2.68 0.5
numeric 2.42 0.4 7.30 1.3
& 2.36 0.4 2.36 0.4
/ 2.24 0.4 2.24 0.4
== 2.08 0.4 2.08 0.4
: 1.58 0.3 1.58 0.3
$ 1.10 0.2 1.10 0.2
dimnames<- 0.28 0.0 0.28 0.0
FUN 0.26 0.0 193.72 34.3
.Call 0.24 0.0 395.60 70.0
<Anonymous> 0.08 0.0 565.18 100.0
apply 0.04 0.0 193.50 34.2
paste 0.02 0.0 0.28 0.0
as.data.frame.matrix 0.02 0.0 0.02 0.0
is.null 0.02 0.0 0.02 0.0
genoud 0.00 0.0 565.42 100.0
t 0.00 0.0 193.50 34.2
optim 0.00 0.0 158.70 28.1
do.call 0.00 0.0 0.30 0.1
mfunc 0.00 0.0 0.30 0.1
f 0.00 0.0 0.28 0.0
lapply 0.00 0.0 0.26 0.0
as.data.frame 0.00 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
genoud 565.42 100.0 0.00 0.0
<Anonymous> 565.18 100.0 0.08 0.0
fn 564.72 99.9 419.64 74.2
.Call 395.60 70.0 0.24 0.0
matrix 213.26 37.7 25.58 4.5
as.vector 196.52 34.8 5.12 0.9
FUN 193.72 34.3 0.26 0.0
apply 193.50 34.2 0.04 0.0
t 193.50 34.2 0.00 0.0
optim 158.70 28.1 0.00 0.0
pnorm 35.02 6.2 28.16 5.0
sum 17.02 3.0 17.02 3.0
factorial 11.16 2.0 7.38 1.3
* 10.90 1.9 10.90 1.9
exp 7.76 1.4 7.76 1.4
numeric 7.30 1.3 2.42 0.4
^ 7.20 1.3 7.20 1.3
- 6.50 1.1 6.50 1.1
vector 4.88 0.9 4.88 0.9
+ 3.66 0.6 3.66 0.6
( 3.42 0.6 3.42 0.6
> 2.78 0.5 2.78 0.5
gamma 2.68 0.5 2.68 0.5
& 2.36 0.4 2.36 0.4
/ 2.24 0.4 2.24 0.4
== 2.08 0.4 2.08 0.4
: 1.58 0.3 1.58 0.3
$ 1.10 0.2 1.10 0.2
do.call 0.30 0.1 0.00 0.0
mfunc 0.30 0.1 0.00 0.0
dimnames<- 0.28 0.0 0.28 0.0
paste 0.28 0.0 0.02 0.0
f 0.28 0.0 0.00 0.0
lapply 0.26 0.0 0.00 0.0
as.data.frame.matrix 0.02 0.0 0.02 0.0
is.null 0.02 0.0 0.02 0.0
as.data.frame 0.02 0.0 0.00 0.0
$sampling.time
[1] 565.42
>
Deepankar
On Sun, 2007-12-02 at 20:54 -0500, jim holtman wrote:
> One thing that I would suggest that you do is to use Rprof on a subset
> of the data that runs for 10-15 minutes and see where some of the hot
> spots are. Since you have not provided commented, minimal,
> self-contained, reproducible code, it is hard to determine where the
> inefficiencies are since we don't have any data to run against it.
> Some of the loop look like you are just assigning a value to a vector,
> e.g.,
>
>
> > if (alive[j]==N1) {
> >
> > for (i in 1:(N1-1)) {
> > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> > }
> > }
> >
> > else {
> > for (i in 1:(N1-1)) {
> > S[N1,i] <- 0
> > }
> >
> > }
>
> that can be done without loops, but without data, it is hard to
> determine. Run Rprof and see what summary.Rprof shows to indicate
> where to focus on.
>
> On Dec 2, 2007 12:49 PM, DEEPANKAR BASU <basu.15 at osu.edu> wrote:
> > R Users:
> >
> > I am trying to estimate a model of fertility behaviour using birth history data with maximum likelihood. My code works but is extremely slow (because of several for loops and my programming inefficiencies); when I use the genetic algorithm to optimize the likelihood function, it takes several days to complete (on a machine with Intel Core 2 processor [2.66GHz] and 2.99 GB RAM). Computing the hessian and using it to calculate the standard errors takes a large chunk of this time.
> >
> > I am copying the code for my likelihood function below; it would be great if someone could suggest any method to speed up the code (by avoiding the for loops or by any other method).
> >
> > I am not providing details of my model or what exactly I am trying to do in each step of the computation below; i would be happy to provide these details if they are deemed necessary for re-working the code.
> >
> > Thanks.
> > Deepankar
> >
> >
> > --------- begin code -----------------------
> >
> > LLK1 <- function(paramets, data.frame, ...) { # DEFINING THE LOGLIKELIHOOD FUNCTION
> >
> > # paramets IS A 1x27 VECTOR OF PARAMETERS OVER WHICH THE FUNCTION WILL BE MAXIMISED
> > # data.frame IS A DATA FRAME. THE DATA FRAME CONTAINS OBSERVATIONS ON SEVERAL VARIABLES
> > # (LIKE EDUCATION, AGE, ETC.) FOR EACH RESPONDENT. COLUMNS REFER TO VARIABLES AND ROWS REFER
> > # TO OBSERVATIONS.
> >
> > ########## PARAMETERS ###############################
> >
> > # alpha: interaction between son targeting and family size
> > # beta : son targeting
> > # gamma : family size
> > # delta : a 1x6 vector of probabilities of male birth at various parities (q1, q2, q3, q4, q5, q6)
> > # zeta : a 1x11 vector of conditional probabilities with zeta[1]=1 always
> >
> > alpha <- paramets[1] # FIRST PARAMETER
> > beta <- paramets[2:9] # SECOND TO SEVENTH PARAMETER
> > gamma <- paramets[10:16]
> > delta <- paramets[17]
> > zeta <- paramets[18:27] # LAST 10 PARAMETERS
> >
> > ################# VARIABLES ###############################
> > # READING IN THE VARIABLES IN THE DATA FRAME
> > # AND RENAMING THEM FOR USE IN THE LIKELIHOOD FUNCTION
> >
> > everborn <- data.frame$v201
> > alive <- data.frame$alive
> > age <- data.frame$age
> > edu <- data.frame$edu
> > rural <- data.frame$rur
> > rich <- data.frame$rich
> > middle <- data.frame$middle
> > poor <- data.frame$poor
> > work <- data.frame$work
> > jointfam <- data.frame$jfam
> > contracep <- data.frame$contra
> > hindu <- data.frame$hindu
> > muslim <- data.frame$muslim
> > scaste <- data.frame$scaste
> > stribe <- data.frame$stribe
> > obc <- data.frame$obc
> > ucaste <- data.frame$ucaste
> > N <- data.frame$dfsize
> > indN <- data.frame$dfsize1 # INDICATOR FUNCTION THAT dfsize==alive
> > nb <- data.frame$nboy
> > ng <- data.frame$ngirl
> > ncord1 <- data.frame$ncord1 # FIRST CHILD: BOY=0; GIRL=1
> > ncord2 <- data.frame$ncord2 #SECOND CHILD: BOY=0; GIRL=1
> > ncord3 <- data.frame$ncord3
> > ncord4 <- data.frame$ncord4
> > ncord5 <- data.frame$ncord5
> > ncord6 <- data.frame$ncord6 # SIXTH CHILD: BOY=0; GIRL=1
> >
> >
> >
> > ######### POSITION OF i^th BOY ################################################
> > boy1 <- data.frame$boy1 # BIRTH POSITION OF FIRST BOY (ZERO IF THE FAMILY HAS NO BOYS)
> > boy2 <- data.frame$boy2 # BIRTH POSITION OF SECOND BOY (ZERO IF THE FAMILY HAS ONLY ONE BOY)
> > boy3 <- data.frame$boy3
> > boy4 <- data.frame$boy4
> > boy5 <- data.frame$boy5
> > boy6 <- data.frame$boy6 # BIRTH POSITION OF SIXTH BOY (ZERO IF THE FAMILY HAS ONLY FIVE BOYS)
> >
> >
> > ######################## CONDITIONAL PROBABILITIES ##########################
> > qq21 <- 1
> >
> > qq31 <- 1/(1+exp(zeta[1]))
> > qq32 <- exp(zeta[1])/(1+exp(zeta[1]))
> >
> > qq41 <- 1/(1+exp(zeta[2])+exp(zeta[3]))
> > qq42 <- exp(zeta[2])/(1+exp(zeta[2])+exp(zeta[3]))
> > qq43 <- exp(zeta[3])/(1+exp(zeta[2])+exp(zeta[3]))
> >
> > qq51 <- 1/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq52 <- exp(zeta[4])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq53 <- exp(zeta[5])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> > qq54 <- exp(zeta[6])/(1+exp(zeta[4])+exp(zeta[5])+exp(zeta[6]))
> >
> > qq61 <- 1/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq62 <- exp(zeta[7])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq63 <- exp(zeta[8])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq64 <- exp(zeta[9])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> > qq65 <- exp(zeta[10])/(1+exp(zeta[7])+exp(zeta[8])+exp(zeta[9])+exp(zeta[10]))
> >
> > zeta1 <- c(qq21,qq31,qq32,qq41,qq42,qq43,qq51,qq52,qq53,qq54,qq61,qq62,qq63,qq64,qq65)
> >
> > #########################################################################
> >
> > n <- length(N) # LENGTH OF SAMPLE; SIZE OF THE MAIN LOOP
> >
> > lglik <- numeric(n) # INITIALIZING THE LOGLIKELIHOOD FUNCTION
> > # CREATES A 1xn VECTOR OF ZEROS
> >
> > for (j in 1:n) { # START OF MAIN LOOP
> >
> > S <- matrix(0, 6, 6) # CREATE A 6x6 MATRIX OF ZEROS
> > y <- numeric(15) # CREATE A 1x15 VECTOR OF ZEROS
> > N1 <- N[j] # DESIRED FAMILY SIZE
> >
> >
> > q <- 1/(1+exp(delta)) # PROBABILITY OF MALE BIRTH
> >
> >
> > if (alive[j]==N1) {
> >
> > for (i in 1:(N1-1)) {
> > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> > }
> > }
> >
> > else {
> > for (i in 1:(N1-1)) {
> > S[N1,i] <- 0
> > }
> >
> > }
> >
> > ######### CREATE A 1x6 VECTOR WITH POSITION OF BOYS WITHIN FAMILY
> > x <- c(boy1[j], boy2[j], boy3[j], boy4[j], boy5[j])
> >
> > if (N1>1) {
> > for (i in 1:(N1-1)) {
> > if (alive[j]>x[i] & x[i]>0) {
> > S[N1,i] <- 0
> > }
> > if (x[i] == alive[j] ) {
> > S[N1,i] <- (q^(nb[j]))*((1-q)^(ng[j]))
> > }
> > }
> > }
> >
> > y <- c(S[2,1],S[3,1],S[3,2],S[4,1],S[4,2],S[4,3],S[5,1],S[5,2],S[5,3],S[5,4],S[6,1],S[6,2],S[6,3],S[6,4],S[6,5])
> >
> >
> > z1 <- c(age[j],edu[j],work[j],rural[j],poor[j],middle[j],hindu[j]) # DETERMINANTS OF FAMILY SIZE
> > z2 <- c(1,age[j],edu[j],work[j],contracep[j],rural[j],jointfam[j],hindu[j]) # DETERMINANTS OF SON TARGETING
> >
> > t1 <- (indN[j])*((q^(nb[j]))*((1-q)^(ng[j])))*(exp(-exp(sum(z1*gamma)))*((exp(sum(z1*gamma)))^N1)*pnorm(-sum(z2*beta)))/factorial(N1)
> > t2 <- (sum(y*zeta1))*(exp(-exp(sum(z1*gamma) + alpha))*((exp(sum(z1*gamma) + alpha))^N1)*(1-pnorm(-sum(z2*beta)))/factorial(N1))
> > lglik[j] <- log(t1+t2)
> > }
> >
> > return(-sum(lglik)) # RETURNING THE NEGATIVE OF THE LOGLIKELIHOOD
> > # SUMMED OVER ALL OBSERVATIONS
> >
> >
> > }
> >
> > ------------ end code ----------------------
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
More information about the R-help
mailing list