[R-sig-ME] parallel MCMCglmm, RNGstreams, starting values & priors

Jarrod Hadfield j.hadfield at ed.ac.uk
Thu Aug 28 20:59:57 CEST 2014


Hi,

The posteriors for the two models look pretty close to me. Are the  
scale reduction factors really as high as previously reported? Before  
you had 1.83 for traitza_children:spouses, but the plot suggests that  
it should be close to 1?

Cheers,

Jarrod




Quoting Ruben Arslan <rubenarslan at gmail.com> on Thu, 28 Aug 2014  
19:59:16 +0200:

> Sure! Thanks a lot.
> I am using ~idh(trait):units already, sorry for saying that  
> incorrectly in my last email.
> These models aren't the final thing, I will replace the  
> paternalage.factor variable
> with its linear equivalent if that seems defensible (does so far)  
> and in this model it seems
> okay to remove the za-effects for all predictors except spouses.
> So a final model would have fewer fixed effects. I also have  
> datasets of 200k+ and 5m+,
> but I'm learning MCMCglmm with this smaller one because my wrong  
> turns take less time.
>
> I've uploaded a comparison coef plot of two models:
> http://i.imgur.com/sHUfnmd.png
> m7 is with the default starting values, m1 is with the specification  
> I sent in my last email. I don't
> know if such differences are something to worry about.
>
> I don't know what qualifies as highly overdispersed, here's a plot  
> of the outcome for ever
> married people (slate=real data):
> http://imgur.com/14MywgZ
> here's with everybody born (incl. some stillborn etc.):
> http://imgur.com/knRGa1v
> I guess my approach (generating an overdispersed poisson with the  
> parameters from
> the data and checking if it has as excess zeroes) is not the best  
> way to diagnose zero-inflation,
> but especially in the second case it seems fairly clear-cut.
>
> Best regards,
>
> Ruben
>
>> summary(m1)
>
>  Iterations = 50001:149951
>  Thinning interval  = 50
>  Sample size  = 2000
>
>  DIC: 31249.73
>
>  G-structure:  ~idh(trait):idParents
>
>                       post.mean  l-95% CI u-95% CI eff.samp
> children.idParents     0.006611 4.312e-08   0.0159    523.9
> za_children.idParents  0.193788 7.306e-02   0.3283    369.3
>
>  R-structure:  ~idh(trait):units
>
>                   post.mean l-95% CI u-95% CI eff.samp
> children.units       0.1285   0.1118   0.1452    716.1
> za_children.units    0.9950   0.9950   0.9950      0.0
>
>  Location effects: children ~ trait * (male + spouses +  
> paternalage.mean + paternalage.factor)
>
>                                             post.mean   l-95% CI    
> u-95% CI eff.samp  pMCMC
> (Intercept)                                 1.3413364  1.2402100   
> 1.4326099     1789 <5e-04 ***
> traitza_children                           -0.8362879 -1.2007980  
> -0.5016730     1669 <5e-04 ***
> male                                        0.0994902  0.0679050   
> 0.1297394     2000 <5e-04 ***
> spouses                                     0.1236033  0.0839000   
> 0.1624939     2000 <5e-04 ***
> paternalage.mean                            0.0533892  0.0119569   
> 0.0933960     2000  0.015 *
> paternalage.factor(25,30]                  -0.0275822 -0.1116421   
> 0.0537359     1842  0.515
> paternalage.factor(30,35]                  -0.0691025 -0.1463214   
> 0.0122393     1871  0.097 .
> paternalage.factor(35,40]                  -0.1419933 -0.2277379  
> -0.0574678     1845 <5e-04 ***
> paternalage.factor(40,45]                  -0.1364952 -0.2362714  
> -0.0451874     1835  0.007 **
> paternalage.factor(45,50]                  -0.1445342 -0.2591767  
> -0.0421178     1693  0.008 **
> paternalage.factor(50,55]                  -0.1302972 -0.2642965   
> 0.0077061     2000  0.064 .
> paternalage.factor(55,90]                  -0.3407879 -0.5168972  
> -0.1493652     1810 <5e-04 ***
> traitza_children:male                       0.0926888 -0.0147379   
> 0.2006142     1901  0.098 .
> traitza_children:spouses                    0.5531197  0.3870616   
> 0.7314289     1495 <5e-04 ***
> traitza_children:paternalage.mean           0.0051463 -0.1279396   
> 0.1460099     1617  0.960
> traitza_children:paternalage.factor(25,30] -0.1538957 -0.4445749   
> 0.1462955     1781  0.321
> traitza_children:paternalage.factor(30,35] -0.1747883 -0.4757851   
> 0.1162476     1998  0.261
> traitza_children:paternalage.factor(35,40] -0.2261843 -0.5464379   
> 0.0892582     1755  0.166
> traitza_children:paternalage.factor(40,45] -0.2807543 -0.6079678   
> 0.0650281     1721  0.100 .
> traitza_children:paternalage.factor(45,50] -0.4905843 -0.8649214  
> -0.1244174     1735  0.010 **
> traitza_children:paternalage.factor(50,55] -0.4648579 -0.9215759  
> -0.0002083     1687  0.054 .
> traitza_children:paternalage.factor(55,90] -0.3945406 -1.0230155   
> 0.2481568     1793  0.195
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
>> describe(krmh.1[spouses>0,])
>                     vars    n mean   sd median trimmed  mad   min    
> max range skew kurtosis   se
> children               2 6829 3.81 2.93   4.00    3.61 2.97  0.00  
> 16.00 16.00 0.47    -0.46 0.04
> male                   3 6829 0.46 0.50   0.00    0.45 0.00  0.00   
> 1.00  1.00 0.14    -1.98 0.01
> spouses                4 6829 1.14 0.38   1.00    1.03 0.00  1.00   
> 4.00  3.00 2.87     8.23 0.00
> paternalage            5 6829 3.65 0.80   3.57    3.60 0.80  1.83   
> 7.95  6.12 0.69     0.70 0.01
> paternalage_c          6 6829 0.00 0.80  -0.08   -0.05 0.80 -1.82   
> 4.30  6.12 0.69     0.70 0.01
> paternalage.mean       7 6829 0.00 0.68  -0.08   -0.05 0.59 -1.74   
> 4.30  6.04 0.95     1.97 0.01
> paternalage.diff       8 6829 0.00 0.42   0.00   -0.01 0.38 -1.51   
> 1.48  2.99 0.17     0.17 0.01
>
>> table(krmh.1$paternalage.factor)
>
>  [0,25] (25,30] (30,35] (35,40] (40,45] (45,50] (50,55] (55,90]
>     309    1214    1683    1562    1039     623     269     130
>
> On 28 Aug 2014, at 19:05, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>
>> Hi Ruben,
>>
>> It might be hard to detect (near) ECPs with so many fixed effects  
>> (can you post the model summary (and give us the mean and standard  
>> deviation of any continuous covariates)). Also, the complementary  
>> log-log link (which is the za specification) is non-symmetric and  
>> runs into problems outside the range -35 to 3.5 so there may be a  
>> problem there, particularly if you use rcov=~trait:units and the  
>> Poisson part is highly over-dispersed.  You could try  
>> rcov=~idh(trait):units and fix the non-identifiable za residual  
>> variance to something smaller than 1 (say 0.5)  - it will mix  
>> slower but it will reduce the chance of over/underflow.
>>
>> Cheers,
>>
>> Jarrod
>>
>>
>>
>>
>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Thu, 28 Aug 2014  
>> 18:45:30 +0200:
>>
>>> Hi Jarrod,
>>>
>>>> 1) it did not return an error with rcov = ~trait:units because  
>>>> you used R1=rpois(2,1)+1 and yet this specification only fits a  
>>>> single variance (not a 2x2 covariance matrix). R1=rpois(2,1)+1 is  
>>>> a bit of a weird specification since it has to be integer. I  
>>>> would obtain starting values using rIW().
>>>
>>> I agree it's a weird specification, I was a bit lost and thought I  
>>> could get away with just putting some random numbers in the  
>>> starting value.
>>> I didn't do R1=rpois(2,1)+1 though, I did R1=diag(rpois(2,1)+1),  
>>> so I got a 2x2 matrix, but yes, bound to be integer.
>>> I didn't know starting values should come from a conjugate  
>>> distribution, though that probably means I didn't think about it  
>>> much.
>>>
>>> I'm now doing
>>> start <- list(
>>> 	liab=c(rnorm( nrow(krmh.1)*2 )),
>>> 	R = list(R1 = rIW( diag(2), nrow(krmh.1)) ),
>>> 	G = list(G1 = rIW( diag(2), nrow(krmh.1)) )
>>> )
>>>
>>> Is this what you had in mind?
>>> I am especially unsure if I am supposed to use such a low sampling  
>>> variability (my sample size is probably not even relevant for the  
>>> starting values) and if I should start from diag(2).
>>>
>>> And, I am still happily confused that this specification still  
>>> doesn't lead to errors with respect to rcov = ~trait:units . Does  
>>> this mean I'm doing it wrong?
>>>
>>>> 3) a) how many effective samples do you have for each parameter?  
>>>> and b) are you getting extreme category problems/numerical  
>>>> issues? If you store the latent variables (pl=TUE) what is their  
>>>> range for the Zi/za part?
>>>
>>> My parallel run using the above starting values isn't finished yet.
>>> a) After applying the above starting values I get, for the  
>>> location effects 1600-2000 samples for a 2000 sample chain (with  
>>> thin set to 50). G and R-structure are from 369  
>>> (za_children.idParents) to 716 (and 0 for the fixed part).
>>> Effective sample sizes were similar for my run using the starting  
>>> values for G/R that I drew from rpois, and using 40 chains I of  
>>> course get
>>> b) I don't think I am getting extreme categories. I would probably  
>>> be getting extreme categories if I included the forever-alones  
>>> (they almost never have children), but this way no.
>>> I wasn't sure how to examine the range of the latents separately  
>>> for the za part, but for a single chain it looks okay:
>>>> quantile(as.numeric(m1$Liab),probs = c(0,0.01,0,0.99,1))
>>>       0%        1%        0%       99%      100%
>>> -4.934111 -1.290728 -4.934111  3.389847  7.484206
>>>
>>> Well, all considered now that I use the above starting value  
>>> specification I get slightly different estimates for all  
>>> za-coefficients. Nothing major, but still leading me to
>>> think my estimates aren't exactly independent of the starting  
>>> values I use. I'll see what the parallel run yields.
>>>
>>> Thanks a lot,
>>>
>>> Ruben
>>>
>>>>
>>>> Cheers,
>>>>
>>>> Jarrod
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Wed, 27 Aug 2014  
>>>> 19:23:42 +0200:
>>>>
>>>>> Hi Jarrod,
>>>>>
>>>>> thanks again. I was able to get it running with your advice.
>>>>> Some points of confusion remain:
>>>>>
>>>>> - You wrote that zi/za models would return an error with rcov =  
>>>>> ~trait:units + starting values. This did not happen in my case,  
>>>>> so I didn't build MCMCglmm myself with your suggested edits.  
>>>>> Also, have you considered putting your own MCMCglmm repo on  
>>>>> Github? Your users would be able to install pre-releases and I'd  
>>>>> think you'd get some time-saving pull requests too.
>>>>> - In my attempts to get my models to run properly, I messed up a  
>>>>> prior and did not use fix=2 in my prior specification for my za  
>>>>> models. This led to crappy convergence, it's much better now and  
>>>>> for some of my simpler models I think I won't need parallel  
>>>>> chains. I'm reminded of Gelman's folk theorem of statistical  
>>>>> computing.
>>>>> - I followed your advice, but of course I could not set the true  
>>>>> values as starting values, but wanted to set random, bad  
>>>>> starting values. I pasted below what I arrived at, I'm  
>>>>> especially unsure whether I specified the starting values for G  
>>>>> and R properly (I think not).
>>>>> 	start <- list(
>>>>> 		liab=c(rnorm( nrow(krmh.1)*2 )),
>>>>> 		R = list(R1 = diag(rpois(2, 1)+1)),
>>>>> 		G = list(G1 = diag(rpois(2, 1)+1))
>>>>> 	)
>>>>>
>>>>>
>>>>> However, even though I may not need multiple chains for some of  
>>>>> my simpler models, I've now run into conflicting diagnostics.  
>>>>> The geweke.diag for each chain (and examination of the traces)  
>>>>> gives
>>>>> satisfactory diagnostics. Comparing multiple chains using  
>>>>> gelman.diag, however, leads to one bad guy, namely the  
>>>>> traitza_children:spouses interaction.
>>>>> I think this implies that I've got some starting value  
>>>>> dependence for this parameter, that won't be easily rectified  
>>>>> through longer burnin?
>>>>> Do you have any ideas how to rectify this?
>>>>> I am currently doing sequential analyses on episodes of  
>>>>> selection and in historical human data only those who marry have  
>>>>> a chance at having kids. I exclude the unmarried
>>>>> from my sample where I predict number of children, because I  
>>>>> examine that in a previous model and the zero-inflation (65%  
>>>>> zeros, median w/o unmarried = 4) when including the unmarried is  
>>>>> so excessive.
>>>>> Number of spouses is easily the strongest predictor in the  
>>>>> model, but only serves as a covariate here. Since my other  
>>>>> estimates are stable across chains and runs and agree well  
>>>>> across models and with theory, I'm
>>>>> inclined to shrug this off. But probably I shouldn't ignore this  
>>>>> sign of non-convergence?
>>>>>
>>>>>> gelman.diag(mcmc_1)
>>>>> Potential scale reduction factors:
>>>>>
>>>>>                                          Point est. Upper C.I.
>>>>> (Intercept)                                      1.00       1.00
>>>>> traitza_children                                 1.27       1.39
>>>>> male                                             1.00       1.00
>>>>> spouses                                          1.00       1.00
>>>>> paternalage.mean                                 1.00       1.00
>>>>> paternalage.factor(25,30]                        1.00       1.00
>>>>> paternalage.factor(30,35]                        1.00       1.00
>>>>> paternalage.factor(35,40]                        1.00       1.00
>>>>> paternalage.factor(40,45]                        1.00       1.00
>>>>> paternalage.factor(45,50]                        1.00       1.00
>>>>> paternalage.factor(50,55]                        1.00       1.00
>>>>> paternalage.factor(55,90]                        1.00       1.00
>>>>> traitza_children:male                            1.22       1.32
>>>>> traitza_children:spouses                         1.83       2.13
>>>>> traitza_children:paternalage.mean                1.02       1.02
>>>>> traitza_children:paternalage.factor(25,30]       1.03       1.05
>>>>> traitza_children:paternalage.factor(30,35]       1.05       1.08
>>>>> traitza_children:paternalage.factor(35,40]       1.10       1.15
>>>>> traitza_children:paternalage.factor(40,45]       1.12       1.17
>>>>> traitza_children:paternalage.factor(45,50]       1.19       1.28
>>>>> traitza_children:paternalage.factor(50,55]       1.12       1.18
>>>>> traitza_children:paternalage.factor(55,90]       1.11       1.17
>>>>>
>>>>> Multivariate psrf
>>>>>
>>>>> 7.27
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Ruben
>>>>>
>>>>>
>>>>> On 26 Aug 2014, at 13:04, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>>>>>
>>>>>> Hi Ruben,
>>>>>>
>>>>>> There are 400 liabilities in a zapoisson model (2 per datum).  
>>>>>> This code should work:
>>>>>>
>>>>>> g <-sample(letters[1:10], size = 200, replace = T)
>>>>>> pred <- rnorm(200)
>>>>>>
>>>>>> l1<-rnorm(200, -1, sqrt(1))
>>>>>> l2<-rnorm(200, -1, sqrt(1))
>>>>>>
>>>>>> y<-VGAM::rzapois(200, exp(l1), exp(-exp(l2)))
>>>>>>
>>>>>> # generate zero-altered data with an intercept of -1 (because  
>>>>>> the intercept and variance are the same for both processes this  
>>>>>> is just standard Poisson)
>>>>>>
>>>>>> dat<-data.frame(y=y, g = g, pred = pred)
>>>>>>
>>>>>>
>>>>>> start.1<-list(liab=c(l1,l2), R = list(R1=diag(2)), G=list(G1=diag(2)))
>>>>>> prior.1<-list(R=list(V=diag(2), nu=1.002, fix=2),  
>>>>>> G=list(G1=list(V=diag(2), nu=2, alpha.mu=c(0,0),  
>>>>>> alpha.V=diag(2)*1000)))
>>>>>>
>>>>>> m1<-MCMCglmm(y~trait + pred:trait, random=~us(trait):g,  
>>>>>> family="zapoisson",rcov=~idh(trait):units, data=dat,  
>>>>>> prior=prior.1, start= start.1)
>>>>>>
>>>>>> However, there are 2 bugs in the current version of MCMCglmm  
>>>>>> that return an error message when the documentation implies it  
>>>>>> should be fine:
>>>>>>
>>>>>> a) it should be possible to have R=diag(2) rather than R =  
>>>>>> list(R1=diag(2)). This bug cropped up when I implemented  
>>>>>> block-diagonal R structures. It can be fixed by inserting:
>>>>>>
>>>>>>        if(!is.list(start$R)){
>>>>>>           start$R<-list(R1=start$R)
>>>>>>        }
>>>>>>
>>>>>> on L514 of MCMCglmm.R below
>>>>>>
>>>>>>        if(!is.list(prior$R[[1]])){
>>>>>>           prior$R<-list(R1=prior$R)
>>>>>>        }
>>>>>>
>>>>>> b) rcov=~trait:units models for zi/za models will return an  
>>>>>> error when passing starting values. To fix this insert
>>>>>>
>>>>>>       if(diagR==3){
>>>>>>         if(dim(start)[1]!=1){
>>>>>>           stop("V is the wrong dimension for some  
>>>>>> strart$G/start$R elements")
>>>>>>         }
>>>>>>         start<-diag(sum(nfl))*start[1]
>>>>>>       }
>>>>>>
>>>>>> on L90 of priorfromat.R below
>>>>>>
>>>>>>       if(is.matrix(start)==FALSE){
>>>>>>         start<-as.matrix(start)
>>>>>>       }
>>>>>>
>>>>>> I will put these in the new version.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Jarrod
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug  
>>>>>> 2014 21:52:30 +0200:
>>>>>>
>>>>>>> Hi Jarrod,
>>>>>>>
>>>>>>> thanks for these pointers.
>>>>>>>
>>>>>>>>> You will need to provide over-dispersed starting values for  
>>>>>>>>> multiple-chain convergence diagnostics to be useful (GLMM  
>>>>>>>>> are so simple I am generally happy if the output of a single  
>>>>>>>>> run looks reasonable).
>>>>>>>
>>>>>>> Oh, I would be happy with single chains, but since computation  
>>>>>>> would take weeks this way, I wanted to parallelise and I would  
>>>>>>> use the multi-chain convergence as a criterion that my  
>>>>>>> parallelisation was proper
>>>>>>> and is as informative as a single long chain. There don't seem  
>>>>>>> to be any such checks built-in – I was analysing my 40 chains  
>>>>>>> for a bit longer than I like to admit until I noticed they  
>>>>>>> were identical (effectiveSize
>>>>>>> and summary.mcmc.list did not yell at me for this).
>>>>>>>
>>>>>>>>> # use some very bad starting values
>>>>>>> I get that these values are bad, but that is the goal for my  
>>>>>>> multi-chain aim, right?
>>>>>>>
>>>>>>> I can apply this to my zero-truncated model, but am again  
>>>>>>> getting stuck with the zero-altered one.
>>>>>>> Maybe I need only specify the Liab values for this?
>>>>>>> At least I'm getting nowhere with specifying R and G starting  
>>>>>>> values here. When I got an error, I always
>>>>>>> went to the MCMCglmm source to understand why the checks  
>>>>>>> failed, but I didn't always understand
>>>>>>> what was being checked and couldn't get it to work.
>>>>>>>
>>>>>>> Here's a failing example:
>>>>>>>
>>>>>>> l<-rnorm(200, -1, sqrt(1))
>>>>>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>>>>>> g = sample(letters[1:10], size = 200, replace = T)
>>>>>>> pred = rnorm(200)
>>>>>>> y<-rpois(200,exp(l)-t)
>>>>>>> y[1:40] = 0
>>>>>>> # generate zero-altered data with an intercept of -1
>>>>>>>
>>>>>>> dat<-data.frame(y=y, g = g, pred = pred)
>>>>>>> set.seed(1)
>>>>>>> start_true = list(Liab=l, R = 1, G = 1 )
>>>>>>> m1<-MCMCglmm(y~1 + pred,random = ~ g,  
>>>>>>> family="zapoisson",rcov=~us(trait):units, data=dat, start=  
>>>>>>> start_true)
>>>>>>>
>>>>>>> # use true latent variable as starting values
>>>>>>> set.seed(1)
>>>>>>> # use some very bad starting values
>>>>>>> start_rand = list(Liab=rnorm(200), R = rpois(1, 1)+1, G =  
>>>>>>> rpois(1, 1)+1 )
>>>>>>> m2<-MCMCglmm(y~1 + pred,random = ~ g,rcov=~us(trait):units,   
>>>>>>> family="zapoisson", data=dat, start = start_rand)
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Ruben
>>>>>>>
>>>>>>> On 25 Aug 2014, at 18:29, Jarrod Hadfield <j.hadfield at ed.ac.uk> wrote:
>>>>>>>
>>>>>>>> Hi Ruben,
>>>>>>>>
>>>>>>>> Sorry  - I was wrong when I said that everything is Gibbs  
>>>>>>>> sampled conditional on the latent variables. The location  
>>>>>>>> effects (fixed and random effects) are also sampled  
>>>>>>>> conditional on the (co)variance components so you should add  
>>>>>>>> them to the starting values. In the case where the true  
>>>>>>>> values are used:
>>>>>>>>
>>>>>>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat,  
>>>>>>>> start=list(Liab=l,R=1))
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Jarrod
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Quoting Jarrod Hadfield <j.hadfield at ed.ac.uk> on Mon, 25 Aug  
>>>>>>>> 2014 17:14:14 +0100:
>>>>>>>>
>>>>>>>>> Hi Ruben,
>>>>>>>>>
>>>>>>>>> You will need to provide over-dispersed starting values for  
>>>>>>>>> multiple-chain convergence diagnostics to be useful (GLMM  
>>>>>>>>> are so simple I am generally happy if the output of a single  
>>>>>>>>> run looks reasonable).
>>>>>>>>>
>>>>>>>>> With non-Gaussian data everything is Gibbs sampled  
>>>>>>>>> conditional on the latent variables, so you only need to  
>>>>>>>>> pass them:
>>>>>>>>>
>>>>>>>>> l<-rnorm(200, -1, sqrt(1))
>>>>>>>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>>>>>>>> y<-rpois(200,exp(l)-t)+1
>>>>>>>>> # generate zero-truncated data with an intercept of -1
>>>>>>>>>
>>>>>>>>> dat<-data.frame(y=y)
>>>>>>>>> set.seed(1)
>>>>>>>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat, start=list(Liab=l))
>>>>>>>>> # use true latent variable as starting values
>>>>>>>>> set.seed(1)
>>>>>>>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat,  
>>>>>>>>> start=list(Liab=rnorm(200)))
>>>>>>>>> # use some very bad starting values
>>>>>>>>>
>>>>>>>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>>>>>>>> # not identical despite the same seed because of different  
>>>>>>>>> starting values but clearly sampling the same posterior  
>>>>>>>>> distribution:
>>>>>>>>>
>>>>>>>>> gelman.diag(mcmc.list(m1$Sol, m2$Sol))
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>>
>>>>>>>>> Jarrod
>>>>>>>>>
>>>>>>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25 Aug  
>>>>>>>>> 2014 18:00:08 +0200:
>>>>>>>>>
>>>>>>>>>> Dear Jarrod,
>>>>>>>>>>
>>>>>>>>>> thanks for the quick reply. Please, don't waste time  
>>>>>>>>>> looking into doMPI – I am happy that I
>>>>>>>>>> get the expected result, when I specify that reproducible  
>>>>>>>>>> seed, whyever that may be.
>>>>>>>>>> I'm pretty sure that is the deciding factor, because I  
>>>>>>>>>> tested it explicitly, I just have no idea
>>>>>>>>>> how/why it interacts with the choice of family.
>>>>>>>>>>
>>>>>>>>>> That said, is setting up different RNG streams for my  
>>>>>>>>>> workers (now that it works) __sufficient__
>>>>>>>>>> so that I get independent chains and can use gelman.diag()  
>>>>>>>>>> for convergence diagnostics?
>>>>>>>>>> Or should I still tinker with the starting values myself?
>>>>>>>>>> I've never found a worked example of supplying starting  
>>>>>>>>>> values and am thus a bit lost.
>>>>>>>>>>
>>>>>>>>>> Sorry for sending further questions, I hope someone else  
>>>>>>>>>> takes pity while
>>>>>>>>>> you're busy with lectures.
>>>>>>>>>>
>>>>>>>>>> Best wishes
>>>>>>>>>>
>>>>>>>>>> Ruben
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 25 Aug 2014, at 17:29, Jarrod Hadfield  
>>>>>>>>>> <j.hadfield at ed.ac.uk> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Ruben,
>>>>>>>>>>>
>>>>>>>>>>> I do not think the issue is with the starting values,  
>>>>>>>>>>> because even if the same starting values were used the  
>>>>>>>>>>> chains would still differ because of the randomness in the  
>>>>>>>>>>> Markov Chain (if I interpret your `identical' test  
>>>>>>>>>>> correctly). This just involves a call to GetRNGstate() in  
>>>>>>>>>>> the C++ code (L 871 ofMCMCglmm.cc) so I think for some  
>>>>>>>>>>> reason doMPI/foreach is not doing what you expect. I am  
>>>>>>>>>>> not familiar with doMPI and am in the middle of writing  
>>>>>>>>>>> lectures so haven't got time to look into it carefully.  
>>>>>>>>>>> Outside of the context of doMPI I get the behaviour I  
>>>>>>>>>>> expect:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> l<-rnorm(200, -1, sqrt(1))
>>>>>>>>>>> t<-(-log(1-runif(200)*(1-exp(-exp(l)))))
>>>>>>>>>>> y<-rpois(200,exp(l)-t)+1
>>>>>>>>>>> # generate zero-truncated data with an intercept of -1
>>>>>>>>>>>
>>>>>>>>>>> dat<-data.frame(y=y)
>>>>>>>>>>> set.seed(1)
>>>>>>>>>>> m1<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>>>>>>>> set.seed(2)
>>>>>>>>>>> m2<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>>>>>>>> set.seed(2)
>>>>>>>>>>> m3<-MCMCglmm(y~1, family="ztpoisson", data=dat)
>>>>>>>>>>>
>>>>>>>>>>> plot(mcmc.list(m1$Sol, m2$Sol))
>>>>>>>>>>> # different, as expected
>>>>>>>>>>> plot(mcmc.list(m2$Sol, m3$Sol))
>>>>>>>>>>> # the same, as expected
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Quoting Ruben Arslan <rubenarslan at gmail.com> on Mon, 25  
>>>>>>>>>>> Aug 2014 16:58:06 +0200:
>>>>>>>>>>>
>>>>>>>>>>>> Dear list,
>>>>>>>>>>>>
>>>>>>>>>>>> sorry for bumping my old post, I hope to elicit a  
>>>>>>>>>>>> response with a more focused question:
>>>>>>>>>>>>
>>>>>>>>>>>> When does MCMCglmm automatically start from different  
>>>>>>>>>>>> values when using doMPI/foreach?
>>>>>>>>>>>>
>>>>>>>>>>>> I have done some tests with models of varying complexity.  
>>>>>>>>>>>> For example, the script in my last
>>>>>>>>>>>> post (using "zapoisson") yielded 40 identical chains:
>>>>>>>>>>>>> identical(mcmclist[1], mcmclist[30])
>>>>>>>>>>>> TRUE
>>>>>>>>>>>>
>>>>>>>>>>>> A simpler (?) model (using "ztpoisson" and no specified  
>>>>>>>>>>>> prior), however, yielded different chains
>>>>>>>>>>>> and I could use them to calculate gelman.diag()
>>>>>>>>>>>>
>>>>>>>>>>>> Changing my script to the version below, i.e. seeding  
>>>>>>>>>>>> foreach using .options.mpi=list( seed= 1337)
>>>>>>>>>>>> so as to make RNGstreams reproducible (or so I  thought),  
>>>>>>>>>>>> led to different chains even for the
>>>>>>>>>>>> "zapoisson" model.
>>>>>>>>>>>>
>>>>>>>>>>>> In no case have I (successfully) tried to supplant the  
>>>>>>>>>>>> default of MCMCglmm's "start" argument.
>>>>>>>>>>>> Is starting my models from different RNGsubstreams  
>>>>>>>>>>>> inadequate compared to manipulating
>>>>>>>>>>>> the start argument explicitly? If so, is there any worked  
>>>>>>>>>>>> example of explicit starting value manipulation
>>>>>>>>>>>> in parallel computation?
>>>>>>>>>>>> I've browsed the MCMCglmm source to understand how the  
>>>>>>>>>>>> default starting values are generated,
>>>>>>>>>>>> but didn't find any differences with respect to RNG for  
>>>>>>>>>>>> the two families "ztpoisson" and "zapoisson"
>>>>>>>>>>>> (granted, I did not dig very deep).
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Ruben Arslan
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # bsub -q mpi -W 12:00 -n 41 -R np20 mpirun -H localhost  
>>>>>>>>>>>> -n 41 R --slave -f  
>>>>>>>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_children_parallel.R"
>>>>>>>>>>>>
>>>>>>>>>>>> library(doMPI)
>>>>>>>>>>>> cl <-  
>>>>>>>>>>>> startMPIcluster(verbose=T,workdir="/usr/users/rarslan/rpqa/rpqa_main/")
>>>>>>>>>>>> registerDoMPI(cl)
>>>>>>>>>>>> Children_mcmc1 = foreach(i=1:clusterSize(cl),.options.mpi  
>>>>>>>>>>>> = list(seed=1337) ) %dopar% {
>>>>>>>>>>>> 	library(MCMCglmm);library(data.table)
>>>>>>>>>>>> 	load("/usr/users/rarslan/rpqa/rpqa1.rdata")
>>>>>>>>>>>>
>>>>>>>>>>>> 	nitt = 130000; thin = 100; burnin = 30000
>>>>>>>>>>>> 	prior.m5d.2 = list(
>>>>>>>>>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>>>>>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>>>>>>>> 	)
>>>>>>>>>>>>
>>>>>>>>>>>> 	rpqa.1 = na.omit(rpqa.1[spouses>0, list(idParents,  
>>>>>>>>>>>> children, male, urban, spouses, paternalage.mean,  
>>>>>>>>>>>> paternalage.factor)])
>>>>>>>>>>>> 	(m1 = MCMCglmm( children ~ trait * (male + urban +  
>>>>>>>>>>>> spouses + paternalage.mean + paternalage.factor),
>>>>>>>>>>>> 						rcov=~us(trait):units,
>>>>>>>>>>>> 						random=~us(trait):idParents,
>>>>>>>>>>>> 						family="zapoisson",
>>>>>>>>>>>> 						prior = prior.m5d.2,
>>>>>>>>>>>> 						data=rpqa.1,
>>>>>>>>>>>> 						pr = F, saveX = F, saveZ = F,
>>>>>>>>>>>> 						nitt=nitt,thin=thin,burnin=burnin))
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>> library(coda)
>>>>>>>>>>>> mcmclist =  
>>>>>>>>>>>> mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>>>>>>>> save(Children_mcmc1,mcmclist, file =  
>>>>>>>>>>>> "/usr/users/rarslan/rpqa/rpqa_main/rpqa_mcmc_kids_za.rdata")
>>>>>>>>>>>> closeCluster(cl)
>>>>>>>>>>>> mpi.quit()
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 04 Aug 2014, at 20:25, Ruben Arslan  
>>>>>>>>>>>> <rubenarslan at gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Dear list,
>>>>>>>>>>>>>
>>>>>>>>>>>>> would someone be willing to share her or his efforts in  
>>>>>>>>>>>>> parallelising a MCMCglmm analysis?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I had something viable using harvestr that seemed to  
>>>>>>>>>>>>> properly initialise
>>>>>>>>>>>>> the starting values from different random number streams  
>>>>>>>>>>>>> (which is desirable,
>>>>>>>>>>>>> as far as I could find out), but I ended up being unable  
>>>>>>>>>>>>> to use harvestr, because
>>>>>>>>>>>>> it uses an old version of plyr, where parallelisation  
>>>>>>>>>>>>> works only for multicore, not for
>>>>>>>>>>>>> MPI.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I pasted my working version, that does not do anything  
>>>>>>>>>>>>> about starting values or RNG
>>>>>>>>>>>>> at the end of this email. I can try to fumble further in  
>>>>>>>>>>>>> the dark or try to update harvestr,
>>>>>>>>>>>>> but maybe someone has gone through all this already.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd also appreciate any tips for elegantly  
>>>>>>>>>>>>> post-processing such parallel data, as some of my usual
>>>>>>>>>>>>> extraction functions and routines are hampered by the  
>>>>>>>>>>>>> fact that some coda functions
>>>>>>>>>>>>> do not aggregate results over chains. (What I get from a  
>>>>>>>>>>>>> single-chain summary in MCMCglmm
>>>>>>>>>>>>> is a bit more comprehensive, than what I managed to  
>>>>>>>>>>>>> cobble together with my own extraction
>>>>>>>>>>>>> functions).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reason I'm parallelising my analyses is that I'm  
>>>>>>>>>>>>> having trouble getting a good effective
>>>>>>>>>>>>> sample size for any parameter having to do with the many  
>>>>>>>>>>>>> zeroes in my data.
>>>>>>>>>>>>> Any pointers are very appreciated, I'm quite  
>>>>>>>>>>>>> inexperienced with MCMCglmm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best wishes
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ruben
>>>>>>>>>>>>>
>>>>>>>>>>>>> # bsub -q mpi-short -W 2:00 -n 42 -R np20 mpirun -H  
>>>>>>>>>>>>> localhost -n 41 R --slave -f  
>>>>>>>>>>>>> "rpqa/rpqa_main/rpqa_children_parallel.r"
>>>>>>>>>>>>> library(doMPI)
>>>>>>>>>>>>> cl <- startMPIcluster()
>>>>>>>>>>>>> registerDoMPI(cl)
>>>>>>>>>>>>> Children_mcmc1 = foreach(i=1:40) %dopar% {
>>>>>>>>>>>>> 	library(MCMCglmm)
>>>>>>>>>>>>> 	load("rpqa1.rdata")
>>>>>>>>>>>>>
>>>>>>>>>>>>> 	nitt = 40000; thin = 100; burnin = 10000
>>>>>>>>>>>>> 	prior = list(
>>>>>>>>>>>>> 		R = list(V = diag(c(1,1)), nu = 0.002),
>>>>>>>>>>>>> 		G=list(list(V=diag(c(1,1e-6)),nu=0.002))
>>>>>>>>>>>>> 	)
>>>>>>>>>>>>>
>>>>>>>>>>>>> 	MCMCglmm( children ~ trait -1 + at.level(trait,1):male  
>>>>>>>>>>>>> + at.level(trait,1):urban + at.level(trait,1):spouses +  
>>>>>>>>>>>>> at.level(trait,1):paternalage.mean +  
>>>>>>>>>>>>> at.level(trait,1):paternalage.factor,
>>>>>>>>>>>>> 		rcov=~us(trait):units,
>>>>>>>>>>>>> 		random=~us(trait):idParents,
>>>>>>>>>>>>> 		family="zapoisson",
>>>>>>>>>>>>> 		prior = prior,
>>>>>>>>>>>>> 		data=rpqa.1,
>>>>>>>>>>>>> 		pr = F, saveX = T, saveZ = T,
>>>>>>>>>>>>> 		nitt=nitt,thin=thin,burnin=burnin)
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>> library(coda)
>>>>>>>>>>>>> mcmclist =  
>>>>>>>>>>>>> mcmc.list(lapply(Children_mcmc1,FUN=function(x) { x$Sol}))
>>>>>>>>>>>>> save(Children_mcmc1,mcmclist, file = "rpqa_mcmc_kids_za.rdata")
>>>>>>>>>>>>> closeCluster(cl)
>>>>>>>>>>>>> mpi.quit()
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Ruben C. Arslan
>>>>>>>>>>>>>
>>>>>>>>>>>>> Georg August University G�ttingen
>>>>>>>>>>>>> Biological Personality Psychology and Psychological Assessment
>>>>>>>>>>>>> Georg Elias M�ller Institute of Psychology
>>>>>>>>>>>>> Go�lerstr. 14
>>>>>>>>>>>>> 37073 G�ttingen
>>>>>>>>>>>>> Germany
>>>>>>>>>>>>> Tel.: +49 551 3920704
>>>>>>>>>>>>> https://psych.uni-goettingen.de/en/biopers/team/arslan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 	[[alternative HTML version deleted]]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> R-sig-mixed-models at r-project.org mailing list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the R-sig-mixed-models mailing list