[R-sig-ME] parallel MCMCglmm, RNGstreams, starting values & priors

Jarrod Hadfield j.hadfield at ed.ac.uk
Tue Aug 26 13:04:34 CEST 2014


Hi Ruben,

There are 400 liabilities in a zapoisson model (2 per datum). This  
code should work:

g <-sample(letters[1:10], size = 200, replace = T)
pred <- rnorm(200)

l1<-rnorm(200, -1, sqrt(1))
l2<-rnorm(200, -1, sqrt(1))

y<-VGAM::rzapois(200, exp(l1), exp(-exp(l2)))

# generate zero-altered data with an intercept of -1 (because the  
intercept and variance are the same for both processes this is just  
standard Poisson)

dat<-data.frame(y=y, g = g, pred = pred)


start.1<-list(liab=c(l1,l2), R = list(R1=diag(2)), G=list(G1=diag(2)))
prior.1<-list(R=list(V=diag(2), nu=1.002, fix=2),  
G=list(G1=list(V=diag(2), nu=2, alpha.mu=c(0,0), alpha.V=diag(2)*1000)))

m1<-MCMCglmm(y~trait + pred:trait, random=~us(trait):g,  
family="zapoisson",rcov=~idh(trait):units, data=dat, prior=prior.1,  
start= start.1)

However, there are 2 bugs in the current version of MCMCglmm that  
return an error message when the documentation implies it should be  
fine:

a) it should be possible to have R=diag(2) rather than R =  
list(R1=diag(2)). This bug cropped up when I implemented  
block-diagonal R structures. It can be fixed by inserting:

           if(!is.list(start$R)){
              start$R<-list(R1=start$R)
           }

on L514 of MCMCglmm.R below

           if(!is.list(prior$R[[1]])){
              prior$R<-list(R1=prior$R)
           }

b) rcov=~trait:units models for zi/za models will return an error when  
passing starting values. To fix this insert

          if(diagR==3){
            if(dim(start)[1]!=1){
              stop("V is the wrong dimension for some strart$G/start$R  
elements")
            }
            start<-diag(sum(nfl))*start[1]
          }

on L90 of priorfromat.R below

          if(is.matrix(start)==FALSE){
            start<-as.matrix(start)
          }

I will put these in the new version.

Cheers,

Jarrod







Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
21:52:30 +0200:

> Hi Jarrod,
>
> thanks for these pointers.
>
>>> You will need to provide over-dispersed starting values for  
>>> multiple-chain convergence diagnostics to be useful (GLMM are so  
>>> simple I am generally happy if the output of a single run looks  
>>> reasonable).
>
> Oh, I would be happy with single chains, but since computation would  
> take weeks this way, I wanted to parallelise and I would use the  
> multi-chain convergence as a criterion that my parallelisation was  
> proper
> and is as informative as a single long chain. There don't seem to be  
> any such checks built-in – I was analysing my 40 chains for a bit  
> longer than I like to admit until I noticed they were identical  
> (effectiveSize
> and summary.mcmc.list did not yell at me for this).
>
>>> # use some very bad starting values
> I get that these values are bad, but that is the goal for my  
> multi-chain aim, right?
>
> I can apply this to my zero-truncated model, but am again getting  
> stuck with the zero-altered one.
> Maybe I need only specify the Liab values for this?
> At least I'm getting nowhere with specifying R and G starting values  
> here. When I got an error, I always
> went to the MCMCglmm source to understand why the checks failed, but  
> I didn't always understand
> what was being checked and couldn't get it to work.
>
> Here's a failing example:
>
> l<-rnorm(200, -1, sqrt(1))
> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
> g = sample(letters[1:10], size = 200, replace = T)
> pred = rnorm(200)
> y<-rpois(200,exp(l)-t)
> y[1:40] = 0
> # generate zero-altered data with an intercept of -1
>
> dat<-data.frame(y=y, g = g, pred = pred)
> set.seed(1)
> start_true = list(Liab=l, R = 1, G = 1 )
> m1<-MCMCglmm(y~1 + pred,random = ~ g,  
> family="zapoisson",rcov=~us(trait):units, data=dat, start= start_true)
>
> # use true latent variable as starting values
> set.seed(1)
> # use some very bad starting values
> start_rand = list(Liab=rnorm(200), R = rpois(1, 1)+1, G = rpois(1, 1)+1 )
> m2<-MCMCglmm(y~1 + pred,random = ~ g,rcov=~us(trait):units,   
> family="zapoisson", data=dat, start = start_rand)
>
> Best,
>
> Ruben
>
> On 25 Aug 2014, at 18:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>
>> Hi Ruben,
>>
>> Sorry  - I was wrong when I said that everything is Gibbs sampled  
>> conditional on the latent variables. The location effects (fixed  
>> and random effects) are also sampled conditional on the  
>> (co)variance components so you should add them to the starting  
>> values. In the case where the true values are used:
>>
>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l,R=1))
>>
>> Cheers,
>>
>> Jarrod
>>
>>
>>
>> Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Mon, 25 Aug 2014  
>> 17:14:14 +0100:
>>
>>> Hi Ruben,
>>>
>>> You will need to provide over-dispersed starting values for  
>>> multiple-chain convergence diagnostics to be useful (GLMM are so  
>>> simple I am generally happy if the output of a single run looks  
>>> reasonable).
>>>
>>> With non-Gaussian data everything is Gibbs sampled conditional on  
>>> the latent variables, so you only need to pass them:
>>>
>>> l<-rnorm(200, -1, sqrt(1))
>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>> y<-rpois(200,exp(l)-t)+1
>>> # generate zero-truncated data with an intercept of -1
>>>
>>> dat<-data.frame(y=y)
>>> set.seed(1)
>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l))
>>> # use true latent variable as starting values
>>> set.seed(1)
>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat,  
>>> start=list(Liab=rnorm(200)))
>>> # use some very bad starting values
>>>
>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>> # not identical despite the same seed because of different  
>>> starting values but clearly sampling the same posterior  
>>> distribution:
>>>
>>> gelman.diag(mcmc.list(m1$Sol, m2$Sol))
>>>
>>> Cheers,
>>>
>>> Jarrod
>>>
>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
>>> 18:00:08 +0200:
>>>
>>>> Dear Jarrod,
>>>>
>>>> thanks for the quick reply. Please, don't waste time looking into  
>>>> doMPI – I am happy that I
>>>> get the expected result, when I specify that reproducible seed,  
>>>> whyever that may be.
>>>> I'm pretty sure that is the deciding factor, because I tested it  
>>>> explicitly, I just have no idea
>>>> how/why it interacts with the choice of family.
>>>>
>>>> That said, is setting up different RNG streams for my workers  
>>>> (now that it works) __sufficient__
>>>> so that I get independent chains and can use gelman.diag() for  
>>>> convergence diagnostics?
>>>> Or should I still tinker with the starting values myself?
>>>> I've never found a worked example of supplying starting values  
>>>> and am thus a bit lost.
>>>>
>>>> Sorry for sending further questions, I hope someone else takes pity while
>>>> you're busy with lectures.
>>>>
>>>> Best wishes
>>>>
>>>> Ruben
>>>>
>>>>
>>>>
>>>> On 25 Aug 2014, at 17:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>>>>
>>>>> Hi Ruben,
>>>>>
>>>>> I do not think the issue is with the starting values, because  
>>>>> even if the same starting values were used the chains would  
>>>>> still differ because of the randomness in the Markov Chain (if I  
>>>>> interpret your `identical' test correctly). This just involves a  
>>>>> call to GetRNGstate() in the C++ code (L 871 ofMCMCglmm.cc) so I  
>>>>> think for some reason doMPI/foreach is not doing what you  
>>>>> expect. I am not familiar with doMPI and am in the middle of  
>>>>> writing lectures so haven't got time to look into it carefully.  
>>>>> Outside of the context of doMPI I get the behaviour I expect:
>>>>>
>>>>>
>>>>> l<-rnorm(200, -1, sqrt(1))
>>>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>>>> y<-rpois(200,exp(l)-t)+1
>>>>> # generate zero-truncated data with an intercept of -1
>>>>>
>>>>> dat<-data.frame(y=y)
>>>>> set.seed(1)
>>>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>> set.seed(2)
>>>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>> set.seed(2)
>>>>> m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>>
>>>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>>>> # different, as expected
>>>>> plot(mcmc.list(m2$Sol, m3$Sol))
>>>>> # the same, as expected
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug 2014  
>>>>> 16:58:06 +0200:
>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> sorry for bumping my old post, I hope to elicit a response with  
>>>>>> a more focused question:
>>>>>>
>>>>>> When does MCMCglmm automatically start from different values  
>>>>>> when using doMPI/foreach?
>>>>>>
>>>>>> I have done some tests with models of varying complexity. For  
>>>>>> example, the script in my last
>>>>>> post (using "zapoisson") yielded 40 identical chains:
>>>>>>> identical(mcmclist[1], mcmclist[30])
>>>>>> TRUE
>>>>>>
>>>>>> A simpler (?) model (using "ztpoisson" and no specified prior),  
>>>>>> however, yielded different chains
>>>>>> and I could use them to calculate gelman.diag()
>>>>>>
>>>>>> Changing my script to the version below, i.e. seeding foreach  
>>>>>> using .options.mpi=list( seed= 1337)
>>>>>> so as to make RNGstreams reproducible (or so I  thought), led  
>>>>>> to different chains even for the
>>>>>> "zapoisson" model.
>>>>>>
>>>>>> In no case have I (successfully) tried to supplant the default  
>>>>>> of MCMCglmm's "start" argument.
>>>>>> Is starting my models from different RNGsubstreams inadequate  
>>>>>> compared to manipulating
>>>>>> the start argument explicitly? If so, is there any worked  
>>>>>> example of explicit starting value manipulation
>>>>>> in parallel computation?
>>>>>> I've browsed the MCMCglmm source to understand how the default  
>>>>>> starting values are generated,
>>>>>> but didn't find any differences with respect to RNG for the two  
>>>>>> families "ztpoisson" and "zapoisson"
>>>>>> (granted, I did not dig very deep).
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Ruben Arslan
>>>>>>
>>>>>>
>>>>>> # bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H localhost -n 41  
>>>>>> R --slave -f  
>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
>>>>>>
>>>>>> library(doMPI)
>>>>>> cl <-  
>>>>>> startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
>>>>>> registerDoMPI(cl)
>>>>>> Children_mcmc1 = foreach(i=1:clusterSize(cl),.options.mpi =  
>>>>>> list(seed=1337) ) %dopar% {
>>>>>> 	library(MCMCglmm);library(data.table)
>>>>>> 	load("/usr/users/rarslan/rpqa/rpqa1.rdata")
>>>>>>
>>>>>> 	nitt = 130000; thin = 100; burnin = 30000
>>>>>> 	prior.m5d.2 = list(
>>>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>> 	)
>>>>>>
>>>>>> 	rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents, children,  
>>>>>> male, urban, spouses, paternalage.mean, paternalage.factor)])
>>>>>> 	(m1 = MCMCglmm( children ~ trait * (male + urban + spouses +  
>>>>>> paternalage.mean + paternalage.factor),
>>>>>> 						rcov=~us(trait):units,
>>>>>> 						random=~us(trait):idParents,
>>>>>> 						family="zapoisson",
>>>>>> 						prior = prior.m5d.2,
>>>>>> 						data=rpqa.1,
>>>>>> 						pr = F, saveX = F, saveZ = F,
>>>>>> 						nitt=nitt,thin=thin,burnin=burnin))
>>>>>> }
>>>>>>
>>>>>> library(coda)
>>>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>> save(Children_mcmc1,mcmclist, file =  
>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
>>>>>> closeCluster(cl)
>>>>>> mpi.quit()
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 04 Aug 2014, at 20:25, Ruben Arslan <rubenarslan at gmail.com> wrote:
>>>>>>
>>>>>>> Dear list,
>>>>>>>
>>>>>>> would someone be willing to share her or his efforts in  
>>>>>>> parallelising a MCMCglmm analysis?
>>>>>>>
>>>>>>> I had something viable using harvestr that seemed to properly  
>>>>>>> initialise
>>>>>>> the starting values from different random number streams  
>>>>>>> (which is desirable,
>>>>>>> as far as I could find out), but I ended up being unable to  
>>>>>>> use harvestr, because
>>>>>>> it uses an old version of plyr, where parallelisation works  
>>>>>>> only for multicore, not for
>>>>>>> MPI.
>>>>>>>
>>>>>>> I pasted my working version, that does not do anything about  
>>>>>>> starting values or RNG
>>>>>>> at the end of this email. I can try to fumble further in the  
>>>>>>> dark or try to update harvestr,
>>>>>>> but maybe someone has gone through all this already.
>>>>>>>
>>>>>>> I'd also appreciate any tips for elegantly post-processing  
>>>>>>> such parallel data, as some of my usual
>>>>>>> extraction functions and routines are hampered by the fact  
>>>>>>> that some coda functions
>>>>>>> do not aggregate results over chains. (What I get from a  
>>>>>>> single-chain summary in MCMCglmm
>>>>>>> is a bit more comprehensive, than what I managed to cobble  
>>>>>>> together with my own extraction
>>>>>>> functions).
>>>>>>>
>>>>>>> The reason I'm parallelising my analyses is that I'm having  
>>>>>>> trouble getting a good effective
>>>>>>> sample size for any parameter having to do with the many  
>>>>>>> zeroes in my data.
>>>>>>> Any pointers are very appreciated, I'm quite inexperienced  
>>>>>>> with MCMCglmm.
>>>>>>>
>>>>>>> Best wishes
>>>>>>>
>>>>>>> Ruben
>>>>>>>
>>>>>>> # bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H localhost  
>>>>>>> -n 41 R --slave -f "rpqa/rpqa_main/rpqa_children_parallel.r"
>>>>>>> library(doMPI)
>>>>>>> cl <- startMPIcluster()
>>>>>>> registerDoMPI(cl)
>>>>>>> Children_mcmc1 = foreach(i=1:40) %dopar% {
>>>>>>> 	library(MCMCglmm)
>>>>>>> 	load("rpqa1.rdata")
>>>>>>>
>>>>>>> 	nitt = 40000; thin = 100; burnin = 10000
>>>>>>> 	prior = list(
>>>>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>>> 	)
>>>>>>>
>>>>>>> 	MCMCglmm( children ~ trait -1 + at.level(trait,1):male +  
>>>>>>> at.level(trait,1):urban + at.level(trait,1):spouses +  
>>>>>>> at.level(trait,1):paternalage.mean +  
>>>>>>> at.level(trait,1):paternalage.factor,
>>>>>>> 		rcov=~us(trait):units,
>>>>>>> 		random=~us(trait):idParents,
>>>>>>> 		family="zapoisson",
>>>>>>> 		prior = prior,
>>>>>>> 		data=rpqa.1,
>>>>>>> 		pr = F, saveX = T, saveZ = T,
>>>>>>> 		nitt=nitt,thin=thin,burnin=burnin)
>>>>>>> }
>>>>>>>
>>>>>>> library(coda)
>>>>>>> mcmclist = mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>>> save(Children_mcmc1,mcmclist, file = "rpqa_mcmc_kids_za.rdata")
>>>>>>> closeCluster(cl)
>>>>>>> mpi.quit()
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ruben C. Arslan
>>>>>>>
>>>>>>> Georg August University G�ttingen
>>>>>>> Biological Personality Psychology and Psychological Assessment
>>>>>>> Georg Elias M�ller Institute of Psychology
>>>>>>> Go�lerstr. 14
>>>>>>> 37073 G�ttingen
>>>>>>> Germany
>>>>>>> Tel.: +49 551 3920704
>>>>>>> https://psych.uni-goettingen.de/en/biopers/team/arslan
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> 	[[alternative HTML version deleted]]
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> The University of Edinburgh is a charitable body, registered in
>>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>>
>>> _______________________________________________
>>> R-sig-mixed-models at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the R-sig-mixed-models mailing list