[R-sig-ME] afex package & bootstrapping
Henrik Singmann
singmann at psychologie.uzh.ch
Wed Jul 6 14:04:40 CEST 2016
Hi Dan,
Your code looks reasonable to me. details = 2 is probably responsible
for the following message which shows you the time it took to calculate
the reference distribution for the first p-value:
"Reference distribution with 996 samples; computing time: 159891.87 secs."
The one period in the output also shows you that the first of 15
p-values has been successfully calculated and the next is being calculating.
So it seems that calculating the reference distribution for the first
p-values took around 44 hours on your 12 cores.
Unfortunately, the speed up by using parallel computation is not linear
when calculating the reference distribution using pbkrtest::PBmodcomp
(which mixed uses as for the computation). This is most likely due to
the fact that some models sometimes are faster than others and then the
cores idle before starting with the next model until all cores can start
the next model (I didn't look at the code of PBmodcomp, but that seems
somewhat likely).
On my 8 cores one can nevertheless see a considerable speed up when
using multicore compared to when not, using an example model (16 samples
is obviously a stupid example):
require(afex)
require("mlmRev")
require(parallel)
(nc <- detectCores()) # 8 in my case
cl <- makeCluster(rep("localhost", nc))
mixed(use ~ age + I(age^2) + urban + livch + (1 | district), family =
binomial, method = "PB", data = Contraception, args.test = list(nsim =
16, cl = cl, details = 2))
## Contrasts set to contr.sum for the following variables: use, urban,
livch, district
## Numerical variables NOT centered on 0 (i.e., interpretation of all
main effects might be difficult if in interactions): age
## Fitting 5 (g)lmer() models:
## [.....]
## Obtaining 4 p-values:
## [Reference distribution with 16 samples; computing time: 33.29 secs.
## .Reference distribution with 16 samples; computing time: 37.25 secs.
## .Reference distribution with 16 samples; computing time: 41.06 secs.
## .Reference distribution with 16 samples; computing time: 27.92 secs.
## .]
mixed(use ~ age + I(age^2) + urban + livch + (1 | district), family =
binomial, method = "PB", data = Contraception, args.test = list(nsim =
16, details = 2))
## Contrasts set to contr.sum for the following variables: use, urban,
livch, district
## Numerical variables NOT centered on 0 (i.e., interpretation of all
main effects might be difficult if in interactions): age
## Fitting 5 (g)lmer() models:
## [.....]
## Obtaining 4 p-values:
## [Reference distribution with 16 samples; computing time: 143.49 secs.
## .Reference distribution with 16 samples; computing time: 201.51 secs.
## .Reference distribution with 16 samples; computing time: 185.76 secs.
## .Reference distribution with 16 samples; computing time: 120.80 secs.
## .]
You can see a speed up from around 150 to 35 seconds per p-value which
corresponds to around a fourfold increase in speed, but not anywhere
near an eightfold increase. So you cannot do much other than wait if you
want to use parametric bootstrap and not e.g., "LRT" (or use a machine
with more cores, but given the assumed problem, the speed up will most
likely be less and less).
Hope that helps,
Henrik (author of afex)
Am 05.07.2016 um 22:59 schrieb Daniel McCloy:
>
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA256
>
> I'm using afex to get bootstrapped p-values. The model fit is quite
> fast, but the bootstrapped p-values are taking forever and I wonder if
> I'm doing something stupid. Relevant details from the model summary:
>
> Generalized linear mixed model fit by maximum likelihood (Laplace
> Approximation) ['glmerMod']
> Family: binomial ( probit )
> Formula: press ~ truth * revb * gend * attn + (1 | subj)
> Number of obs: 20480, groups: subj, 16
>
> I set the script to use all 12 cores of my local machine and run over
> the weekend:
>
> library(afex)
> library(parallel)
> cl <- makeForkCluster(nnodes=12, outfile="cluster-log.txt")
> form <- formula(press ~ truth*revb*gend*attn + (1|subj))
> rev_mod <- mixed(form, data=revb_df, family=binomial(link="probit"),
> method="PB", check.contrasts=FALSE, cl=cl,
> args.test=list(nsim=1000, cl=cl, seed=1234, details=2),
> control=glmerControl(optCtrl=list(maxfun=30000)))
>
> After running for 3.5 days, all 12 cores are still maxed out, RAM usage
> looks fine (around 16GB in use of the 32GB total), and the afex output
> looks like this:
>
> Fitting 16 (g)lmer() models.
> Obtaining 15 p-values:
> [Reference distribution with 996 samples; computing time: 159891.87 secs.
> .
>
> (the single period on the 4th line is part of the output). The first 2
> lines appeared before I left work on friday (i.e., within about 15
> minutes of starting the script). So presumably it has been working on
> the p-values for the remaining 80+ hours. I don't know exactly when the
> "computing time" line appeared, but its estimate works out to about 44
> hours... is that just for the first of the 15 p-values, or for all of
> them? Is creating a cluster with all 12 cores a foolish idea? (Am I
> thereby strangling the multithreading abilities of some underlying
> linalg library, effectively slowing things down?) The dataset and
> number of iterations just don't seem all that huge to me but maybe I'm
> being naive.
>
> - -- dan
>
> Daniel McCloy
> http://dan.mccloy.info
> Postdoctoral Research Associate
> Institute for Learning & Brain Sciences
> University of Washington
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v2
>
> iQEcBAEBCAAGBQJXfB+cAAoJEEzlbMQqUVLOIuUH/Rb920Pcbedb6WaudG0KyifM
> T3hQEALxFLyHttrbRAC1CrHgD4XClrTNQJGs3jcr+F23T/JYaR1jizFJ6SOBLi5i
> niB+F5z2nD5yLsYSAHvvvfkcHTGCsFmUwp2tvThxYySAjehpzKzcm30ZSVcZH3qN
> VpaWR0fBH3rmqwiUqE6IT3Nbt+wV9F2G/SK+wyX9waSwRN1SpiiCwQ1AIxU5s6SI
> BSRL96nBLf93OLUgGZ22Lvu9w/AR7xNT6uNE2R9XdSMWF7/u+/eEgw/GVDWe+1l+
> NYKN28qoynBeEagMfeZt+JCq6OG9yIWv4z6cE7EpPtTbxMqebrIQUrFU4uIGoAI=
> =nuKw
> -----END PGP SIGNATURE-----
>
More information about the R-sig-mixed-models
mailing list