[R-sig-ME] Failing to understand effect of centering a predictor in multilevel model

Thierry Onkelinx thierry.onkelinx at inbo.be
Thu Sep 21 11:36:40 CEST 2017


Dear Chris,

This is because you use poly() on all variables

x <- 10:20
cx <- x - mean(x)
all.equal(poly(x, 2), poly(cx, 2), check.attributes = FALSE)
all.equal(poly(x, 2), poly(cx, 2))

Best regards,

ir. Thierry Onkelinx
Statisticus/ Statiscian

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx op inbo.be
Kliniekstraat 25, B-1070 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////


Van 14 tot en met 19 december 2017 verhuizen we uit onze vestiging in
Brussel naar het Herman Teirlinckgebouw op de site Thurn & Taxis.
Vanaf dan ben je welkom op het nieuwe adres: Havenlaan 88 bus 73, 1000 Brussel.

///////////////////////////////////////////////////////////////////////////////////////////



2017-09-21 11:29 GMT+02:00 Chris Evans <chrishold op psyctc.org>:
> I am trying to make sure I understand the use of multilevel modelling to look at change (on questionnaire scores) across sessions of psychotherapy. This is widely done in my applied field. I am a researcher (and ex-therapist) not a statistician and have no direct availability of a statistician but I'm fairly savvy about stats. I am simulating, rather crudely, data which have both fixed and random intercepts and slopes with a quadratic as well as a linear effect of session number. I'm using lmer from lme4 overloaded by lmerTest in 64 bit R 3.4.1 on Windoze running in Rstudio.
>
> The thing that's baffling me is that centering my predictor (session count) seems to have NO effect on my results, neither on the fixed nor the random effects. I would have expected the intercept to have changed at the very least given the effects I've simulated.
>
> Details: session count can range from 1 to 51 (the number of sessions per client in my simulation can vary, realistically, from 4 to 51 so no client has fewer than four). I have modelled in a random linear and quadratic effect of session count but such that the overall effects are both negative. Code and session info are at the end of the post. Here is code and output from uncentred:
>
> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
> data=longdat4)
> summary(m4)
>
> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
> ## to degrees of freedom [lmerMod]
> ## Formula:
> ## DEP_TOTAL ~ poly(session, 2) + (1 + poly(session, 2) | validPOC_ID)
> ## Data: longdat4
> ##
> ## REML criterion at convergence: -29507
> ##
> ## Scaled residuals:
> ## Min 1Q Median 3Q Max
> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
> ##
> ## Random effects:
> ## Groups Name Variance Std.Dev. Corr
> ## validPOC_ID (Intercept) 9.309e+00 3.05101
> ## poly(session, 2)1 7.390e+03 85.96686 0.20
> ## poly(session, 2)2 4.154e+02 20.38232 0.21 1.00
> ## Residual 4.973e-03 0.07052
> ## Number of obs: 15399, groups: validPOC_ID, 500
> ##
> ## Fixed effects:
> ## Estimate Std. Error df t value Pr(>|t|)
> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
> ## poly(session, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
> ## poly(session, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
> ## ---
> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ##
> ## Correlation of Fixed Effects:
> ## (Intr) p(,2)1
> ## ply(sss,2)1 0.210
> ## ply(sss,2)2 0.215 0.995
>
> Here is the centered one (lengthCentre is 29.5).
>
> longdat4 $ sessionC1 <- longdat4 $ session - lengthCentre summary ( longdat4 $ sessionC1 )
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## -28.50 -21.50 -12.50 -10.09 -0.50 21.50
> m5 <- lmer ( DEP_TOTAL ~ poly ( sessionC1 , 2 ) + ( 1 + poly ( sessionC1 , 2 ) | validPOC_ID ) , data = longdat4 ) summary ( m5 )
> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
> ## to degrees of freedom [lmerMod]
> ## Formula:
> ## DEP_TOTAL ~ poly(sessionC1, 2) + (1 + poly(sessionC1, 2) | validPOC_ID)
> ## Data: longdat4
> ##
> ## REML criterion at convergence: -29507
> ##
> ## Scaled residuals:
> ## Min 1Q Median 3Q Max
> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
> ##
> ## Random effects:
> ## Groups Name Variance Std.Dev. Corr
> ## validPOC_ID (Intercept) 9.309e+00 3.05101
> ## poly(sessionC1, 2)1 7.390e+03 85.96686 0.20
> ## poly(sessionC1, 2)2 4.154e+02 20.38232 0.21 1.00
> ## Residual 4.973e-03 0.07052
> ## Number of obs: 15399, groups: validPOC_ID, 500
> ##
> ## Fixed effects:
> ## Estimate Std. Error df t value Pr(>|t|)
> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
> ## poly(sessionC1, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
> ## poly(sessionC1, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
> ## ---
> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ##
> ## Correlation of Fixed Effects:
> ## (Intr) p(C1,2)1
> ## ply(sC1,2)1 0.210
> ## ply(sC1,2)2 0.215 0.995
>
> This makes no sense to me. I had expected that the very crude centering must shift the fixed intercept as the x=0 has shifted left by 29.5 and I have strong linear and quadratic effects in the model. I also thought (the whole reason I'm doing this) that it would change the correlation between the linear and quadratic terms as I should be moving from a strong positive correlation between them to a much smaller one.
>
> I'm sure I'm doing something utterly stupid, or revealing shocking failure to comprehend the multilevel model, but I'm looking straight through the explanation and stuck. I hope someone will help me!
>
> TIA,
>
>
> Chris
>
>
>
> Here is entire pertinent code.
>
> ```{r sim4,warning=FALSE,message=FALSE,cache=TRUE}
> require(reshape2)
> set.seed(12345)
> ### set up the parameters for the simulation
> n <- 500
> minNSessions <- 4
> maxNSessions <- 55
> linSlope <- -.02
> errVar <- .005
> ### derived variables
> lengthRange <- minNSessions:maxNSessions
> lengthCentre <- (minNSessions+maxNSessions)/2
> ID <- 1:n
> ### create a variable, length = n for the session counts
> sessions <- sample(lengthRange,n,replace=TRUE)
> ### for now, give all starting scores of 10
> startScores <- rnorm(n,mean=10,sd=3)
> slopes <- rnorm(n,mean=linSlope,sd=.01) # random slopes
> ### set up matrix to store the simulated scores
> scores <- matrix(rep(NA,n*maxNSessions),nrow=n)
> dat4 <- as.data.frame(cbind(ID,sessions,startScores,slopes,scores))
> for (i in 1:nrow(dat4)) {
> nSessions <- dat4$sessions[i]
> tmpScores <- rep(dat4$startScores[i],nSessions) # scores: no slopes
> tmpScores <- tmpScores + c(0,slopes[i]*(2:nSessions)) # scores: add linear effect
> tmpScores <- tmpScores + c(0,slopes[i]*((2:nSessions)^2))/10 # add quadratic effect
> tmpScores <- tmpScores + rnorm(nSessions,mean=0,sd=sqrt(errVar)) # now add error
> dat4[i,5:(4+nSessions)] <- tmpScores # put into data frame
> }
> ### now melt to create a long data frame
> longdat4 <- melt(dat4,id.vars="ID",measure.vars = 4:55)
> colnames(longdat4) <- c("validPOC_ID","wks","DEP_TOTAL")
> longdat4$session <- as.numeric(substr(longdat4$wks,2,3))-4 # minus 4 to get start at 1
> # as data start in column 5
> longdat4 <- longdat4[order(longdat4$validPOC_ID,longdat4$wks),]
> longdat4 <- na.omit(longdat4) # ditch the empty data
> ```
>
> So the data frame longdat4 has
> `r nrow(longdat4)` rows and
> `r ncol(longdat4)` columns and the column names are
> `r colnames(longdat4)`. Here is a bit of the head of the wide data frame (dat4) and the head of the long one (longdat4).
>
> ```{r head4,warning=FALSE,message=FALSE,cache=TRUE}
> require(knitr)
> kable(head(dat4[,1:10]),digits=3)
> kable(head(longdat4))
> ```
>
> Elisa's lattice plot.
>
> ```{r spaghetti4,warning=FALSE,message=FALSE,cache=TRUE}
> library(lattice)
> samp <- sample(unique(longdat4$validPOC_ID), 35)
> xyplot(DEP_TOTAL ~ wks|validPOC_ID, longdat4,
> type = c("p","g","r"),
> subset = (validPOC_ID %in% samp),
> col="dark blue",
> col.line="black",
> strip = F,
> xlab = "Weeks in treatment",
> ylab = "Depression Scores")
> ```
>
> Now Elisa's tangle plot. Just first 40 simulated cases.
>
> ```{r tangle4, warning=FALSE,message=FALSE,cache=TRUE}
> library(ggplot2)
> first40 <- longdat4[longdat4$validPOC_ID <=40,]
> tanglePlot <- ggplot(data = longdat4, aes(x = wks, y = DEP_TOTAL)) +
> geom_line(data = longdat4, alpha = 0.5, aes(group = validPOC_ID)) +
> theme_bw()
> tanglePlot
> ```
>
> ```{r model4, warning=FALSE, message=FALSE,cache=TRUE}
> require(lmerTest)
> ### this next fails to converge and has perfect correlation of linear & quadratic effects
> # m4 <- lmer(DEP_TOTAL ~ session+I(session^2)+(1+session+I(session^2)|validPOC_ID),
> # data=longdat4)
> ### is this better:
> summary(longdat4$session)
> hist(longdat4$session,
> main="Histogram of raw session counts",
> xlab="Raw session n",
> col="grey")
> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
> data=longdat4)
> summary(m4)
> ```
>
> This next should test the random effect (and should find it highly significant I think).
>
> ```{r mode4rand,eval=FALSE}
> rand(m4)
> ### fails with message:
> ### Error in terms.formula(tmp, simplify = TRUE) :
> ### invalid model formula in ExtractVars
> ```
>
> And this summarises the random effects. These are deviations from the fixed effects so both are centred on zero.
>
> ```{r coeffs4,fig.width=11,fig.height=8,cache=TRUE}
> apply(ranef(m4)$validPOC_ID,2,summary)
> par(mfrow=c(1,3))
> hist(ranef(m4)$validPOC_ID[,1],
> xlab="Intercept deviation values",
> main="Random intercepts",
> col="grey")
> abline(v=0)
> hist(ranef(m4)$validPOC_ID[,2],
> main="Random slopes",
> xlab="Linear slope deviation values",
> col="grey")
> abline(v=0)
> hist(ranef(m4)$validPOC_ID[,3],
> main="Random slopes",
> xlab="Quadratic slope deviation values",
> col="grey")
> abline(v=0)
> ```
>
>
> # Simulation 5: random intercept and random slope with quadratic effect, CENTRED
>
> OK. Random intercept and slope has linear and quadratic effects. Now going to centre the session count by subtracting lengthCentre
>
>
> ```{r model5, warning=FALSE, message=FALSE}
> longdat4$sessionC1 <- longdat4$session - lengthCentre
> summary(longdat4$sessionC1)
> hist(longdat4$sessionC1,
> main="Histogram of centred session counts",
> xlab="Centred session n",
> col="grey")
> abline(v=0)
> require(lmerTest)
> m5 <- lmer(DEP_TOTAL ~ poly(sessionC1,2)+(1+poly(sessionC1,2)|validPOC_ID),
> data=longdat4)
> summary(m5)
> ```
>
>
>
> ```{r mode5rand,eval=FALSE}
> rand(m5)
> ### fails with message:
> ### Error in terms.formula(tmp, simplify = TRUE) :
> ### invalid model formula in ExtractVars
> ```
>
>
> ```{r coeffs5,fig.width=11,fig.height=8}
> apply(ranef(m5)$validPOC_ID,2,summary)
> par(mfrow=c(1,3))
> hist(ranef(m5)$validPOC_ID[,1],
> xlab="Intercept deviation values",
> main="Random intercepts",
> col="grey")
> abline(v=0)
> hist(ranef(m5)$validPOC_ID[,2],
> main="Random slopes",
> xlab="Linear slope deviation values",
> col="grey")
> abline(v=0)
> hist(ranef(m5)$validPOC_ID[,3],
> main="Random slopes",
> xlab="Quadratic slope deviation values",
> col="grey")
> abline(v=0)
> ```
>
> And here is sessionInfo()
>
>> sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows >= 8 x64 (build 9200)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ggplot2_2.2.1 lattice_0.20-35 knitr_1.17 lmerTest_2.0-33 lme4_1.1-13
> [6] Matrix_1.2-11 reshape2_1.4.2
>
> loaded via a namespace (and not attached):
> [1] Rcpp_0.12.12 highr_0.6 RColorBrewer_1.1-2 compiler_3.4.1
> [5] nloptr_1.0.4 plyr_1.8.4 base64enc_0.1-3 tools_3.4.1
> [9] rpart_4.1-11 digest_0.6.12 checkmate_1.8.3 htmlTable_1.9
> [13] evaluate_0.10.1 tibble_1.3.4 nlme_3.1-131 gtable_0.2.0
> [17] rlang_0.1.2 yaml_2.1.14 gridExtra_2.3 cluster_2.0.6
> [21] stringr_1.2.0 htmlwidgets_0.9 nnet_7.3-12 rprojroot_1.2
> [25] grid_3.4.1 data.table_1.10.4 survival_2.41-3 foreign_0.8-69
> [29] rmarkdown_1.6 latticeExtra_0.6-28 minqa_1.2.4 Formula_1.2-2
> [33] magrittr_1.5 backports_1.1.0 Hmisc_4.0-3 scales_0.5.0
> [37] htmltools_0.3.6 MASS_7.3-47 splines_3.4.1 rsconnect_0.8.5
> [41] colorspace_1.3-2 labeling_0.3 stringi_1.1.5 acepack_1.4.1
> [45] lazyeval_0.2.0 munsell_0.4.3
>>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models



More information about the R-sig-mixed-models mailing list