[R-sig-ME] Failing to understand effect of centering a predictor in multilevel model

Diana Michl dmichl at uni-potsdam.de
Tue Sep 26 13:30:41 CEST 2017


This may be a superficial answer given the detail of your question, but 
as far as I understand, centering your predictor is a linear 
transformation. You're simply subtracting values so the predictor is 
centered around a different value than before. Thus, it should *not* 
make a difference to your results.
Centering and scaling predictor is done to make results more easily 
interpretable and to deal with different scales your predictors are on - 
not change your results.

Cheers
Diana


Am 21.09.2017 um 09:48 schrieb Chris Evans:
> I am trying to make sure I understand the use of multilevel modelling to look at change (on questionnaire scores) across sessions of psychotherapy. This is widely done in my applied field. I am a researcher (and ex-therapist) not a statistician and have no direct availability of a statistician but I'm fairly savvy about stats. I am simulating, rather crudely, data which have both fixed and random intercepts and slopes with a quadratic as well as a linear effect of session number. I'm using lmer from lme4 overloaded by lmerTest in 64 bit R 3.4.1 on Windoze running in Rstudio.
>
> The thing that's baffling me is that centering my predictor (session count) seems to have NO effect on my results, neither on the fixed nor the random effects. I would have expected the intercept to have changed at the very least given the effects I've simulated.
>
> Details: session count can range from 1 to 51 (the number of sessions per client in my simulation can vary, realistically, from 4 to 51 so no client has fewer than four). I have modelled in a random linear and quadratic effect of session count but such that the overall effects are both negative. Code and session info are at the end of the post. Here is code and output from uncentred:
>
> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
> data=longdat4)
> summary(m4)
>
> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
> ## to degrees of freedom [lmerMod]
> ## Formula:
> ## DEP_TOTAL ~ poly(session, 2) + (1 + poly(session, 2) | validPOC_ID)
> ## Data: longdat4
> ##
> ## REML criterion at convergence: -29507
> ##
> ## Scaled residuals:
> ## Min 1Q Median 3Q Max
> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
> ##
> ## Random effects:
> ## Groups Name Variance Std.Dev. Corr
> ## validPOC_ID (Intercept) 9.309e+00 3.05101
> ## poly(session, 2)1 7.390e+03 85.96686 0.20
> ## poly(session, 2)2 4.154e+02 20.38232 0.21 1.00
> ## Residual 4.973e-03 0.07052
> ## Number of obs: 15399, groups: validPOC_ID, 500
> ##
> ## Fixed effects:
> ## Estimate Std. Error df t value Pr(>|t|)
> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
> ## poly(session, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
> ## poly(session, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
> ## ---
> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ##
> ## Correlation of Fixed Effects:
> ## (Intr) p(,2)1
> ## ply(sss,2)1 0.210
> ## ply(sss,2)2 0.215 0.995
>
> Here is the centered one (lengthCentre is 29.5).
>
> longdat4 $ sessionC1 <- longdat4 $ session - lengthCentre summary ( longdat4 $ sessionC1 )
> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
> ## -28.50 -21.50 -12.50 -10.09 -0.50 21.50
> m5 <- lmer ( DEP_TOTAL ~ poly ( sessionC1 , 2 ) + ( 1 + poly ( sessionC1 , 2 ) | validPOC_ID ) , data = longdat4 ) summary ( m5 )
> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
> ## to degrees of freedom [lmerMod]
> ## Formula:
> ## DEP_TOTAL ~ poly(sessionC1, 2) + (1 + poly(sessionC1, 2) | validPOC_ID)
> ## Data: longdat4
> ##
> ## REML criterion at convergence: -29507
> ##
> ## Scaled residuals:
> ## Min 1Q Median 3Q Max
> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
> ##
> ## Random effects:
> ## Groups Name Variance Std.Dev. Corr
> ## validPOC_ID (Intercept) 9.309e+00 3.05101
> ## poly(sessionC1, 2)1 7.390e+03 85.96686 0.20
> ## poly(sessionC1, 2)2 4.154e+02 20.38232 0.21 1.00
> ## Residual 4.973e-03 0.07052
> ## Number of obs: 15399, groups: validPOC_ID, 500
> ##
> ## Fixed effects:
> ## Estimate Std. Error df t value Pr(>|t|)
> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
> ## poly(sessionC1, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
> ## poly(sessionC1, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
> ## ---
> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> ##
> ## Correlation of Fixed Effects:
> ## (Intr) p(C1,2)1
> ## ply(sC1,2)1 0.210
> ## ply(sC1,2)2 0.215 0.995
>
> This makes no sense to me. I had expected that the very crude centering must shift the fixed intercept as the x=0 has shifted left by 29.5 and I have strong linear and quadratic effects in the model. I also thought (the whole reason I'm doing this) that it would change the correlation between the linear and quadratic terms as I should be moving from a strong positive correlation between them to a much smaller one.
>
> I'm sure I'm doing something utterly stupid, or revealing shocking failure to comprehend the multilevel model, but I'm looking straight through the explanation and stuck. I hope someone will help me!
>
> TIA,
>
>
> Chris
>
>
>
> Here is entire pertinent code.
>
> ```{r sim4,warning=FALSE,message=FALSE,cache=TRUE}
> require(reshape2)
> set.seed(12345)
> ### set up the parameters for the simulation
> n <- 500
> minNSessions <- 4
> maxNSessions <- 55
> linSlope <- -.02
> errVar <- .005
> ### derived variables
> lengthRange <- minNSessions:maxNSessions
> lengthCentre <- (minNSessions+maxNSessions)/2
> ID <- 1:n
> ### create a variable, length = n for the session counts
> sessions <- sample(lengthRange,n,replace=TRUE)
> ### for now, give all starting scores of 10
> startScores <- rnorm(n,mean=10,sd=3)
> slopes <- rnorm(n,mean=linSlope,sd=.01) # random slopes
> ### set up matrix to store the simulated scores
> scores <- matrix(rep(NA,n*maxNSessions),nrow=n)
> dat4 <- as.data.frame(cbind(ID,sessions,startScores,slopes,scores))
> for (i in 1:nrow(dat4)) {
> nSessions <- dat4$sessions[i]
> tmpScores <- rep(dat4$startScores[i],nSessions) # scores: no slopes
> tmpScores <- tmpScores + c(0,slopes[i]*(2:nSessions)) # scores: add linear effect
> tmpScores <- tmpScores + c(0,slopes[i]*((2:nSessions)^2))/10 # add quadratic effect
> tmpScores <- tmpScores + rnorm(nSessions,mean=0,sd=sqrt(errVar)) # now add error
> dat4[i,5:(4+nSessions)] <- tmpScores # put into data frame
> }
> ### now melt to create a long data frame
> longdat4 <- melt(dat4,id.vars="ID",measure.vars = 4:55)
> colnames(longdat4) <- c("validPOC_ID","wks","DEP_TOTAL")
> longdat4$session <- as.numeric(substr(longdat4$wks,2,3))-4 # minus 4 to get start at 1
> # as data start in column 5
> longdat4 <- longdat4[order(longdat4$validPOC_ID,longdat4$wks),]
> longdat4 <- na.omit(longdat4) # ditch the empty data
> ```
>
> So the data frame longdat4 has
> `r nrow(longdat4)` rows and
> `r ncol(longdat4)` columns and the column names are
> `r colnames(longdat4)`. Here is a bit of the head of the wide data frame (dat4) and the head of the long one (longdat4).
>
> ```{r head4,warning=FALSE,message=FALSE,cache=TRUE}
> require(knitr)
> kable(head(dat4[,1:10]),digits=3)
> kable(head(longdat4))
> ```
>
> Elisa's lattice plot.
>
> ```{r spaghetti4,warning=FALSE,message=FALSE,cache=TRUE}
> library(lattice)
> samp <- sample(unique(longdat4$validPOC_ID), 35)
> xyplot(DEP_TOTAL ~ wks|validPOC_ID, longdat4,
> type = c("p","g","r"),
> subset = (validPOC_ID %in% samp),
> col="dark blue",
> col.line="black",
> strip = F,
> xlab = "Weeks in treatment",
> ylab = "Depression Scores")
> ```
>
> Now Elisa's tangle plot. Just first 40 simulated cases.
>
> ```{r tangle4, warning=FALSE,message=FALSE,cache=TRUE}
> library(ggplot2)
> first40 <- longdat4[longdat4$validPOC_ID <=40,]
> tanglePlot <- ggplot(data = longdat4, aes(x = wks, y = DEP_TOTAL)) +
> geom_line(data = longdat4, alpha = 0.5, aes(group = validPOC_ID)) +
> theme_bw()
> tanglePlot
> ```
>
> ```{r model4, warning=FALSE, message=FALSE,cache=TRUE}
> require(lmerTest)
> ### this next fails to converge and has perfect correlation of linear & quadratic effects
> # m4 <- lmer(DEP_TOTAL ~ session+I(session^2)+(1+session+I(session^2)|validPOC_ID),
> # data=longdat4)
> ### is this better:
> summary(longdat4$session)
> hist(longdat4$session,
> main="Histogram of raw session counts",
> xlab="Raw session n",
> col="grey")
> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
> data=longdat4)
> summary(m4)
> ```
>
> This next should test the random effect (and should find it highly significant I think).
>
> ```{r mode4rand,eval=FALSE}
> rand(m4)
> ### fails with message:
> ### Error in terms.formula(tmp, simplify = TRUE) :
> ### invalid model formula in ExtractVars
> ```
>
> And this summarises the random effects. These are deviations from the fixed effects so both are centred on zero.
>
> ```{r coeffs4,fig.width=11,fig.height=8,cache=TRUE}
> apply(ranef(m4)$validPOC_ID,2,summary)
> par(mfrow=c(1,3))
> hist(ranef(m4)$validPOC_ID[,1],
> xlab="Intercept deviation values",
> main="Random intercepts",
> col="grey")
> abline(v=0)
> hist(ranef(m4)$validPOC_ID[,2],
> main="Random slopes",
> xlab="Linear slope deviation values",
> col="grey")
> abline(v=0)
> hist(ranef(m4)$validPOC_ID[,3],
> main="Random slopes",
> xlab="Quadratic slope deviation values",
> col="grey")
> abline(v=0)
> ```
>
>
> # Simulation 5: random intercept and random slope with quadratic effect, CENTRED
>
> OK. Random intercept and slope has linear and quadratic effects. Now going to centre the session count by subtracting lengthCentre
>
>
> ```{r model5, warning=FALSE, message=FALSE}
> longdat4$sessionC1 <- longdat4$session - lengthCentre
> summary(longdat4$sessionC1)
> hist(longdat4$sessionC1,
> main="Histogram of centred session counts",
> xlab="Centred session n",
> col="grey")
> abline(v=0)
> require(lmerTest)
> m5 <- lmer(DEP_TOTAL ~ poly(sessionC1,2)+(1+poly(sessionC1,2)|validPOC_ID),
> data=longdat4)
> summary(m5)
> ```
>
>
>
> ```{r mode5rand,eval=FALSE}
> rand(m5)
> ### fails with message:
> ### Error in terms.formula(tmp, simplify = TRUE) :
> ### invalid model formula in ExtractVars
> ```
>
>
> ```{r coeffs5,fig.width=11,fig.height=8}
> apply(ranef(m5)$validPOC_ID,2,summary)
> par(mfrow=c(1,3))
> hist(ranef(m5)$validPOC_ID[,1],
> xlab="Intercept deviation values",
> main="Random intercepts",
> col="grey")
> abline(v=0)
> hist(ranef(m5)$validPOC_ID[,2],
> main="Random slopes",
> xlab="Linear slope deviation values",
> col="grey")
> abline(v=0)
> hist(ranef(m5)$validPOC_ID[,3],
> main="Random slopes",
> xlab="Quadratic slope deviation values",
> col="grey")
> abline(v=0)
> ```
>
> And here is sessionInfo()
>
>> sessionInfo()
> R version 3.4.1 (2017-06-30)
> Platform: x86_64-w64-mingw32/x64 (64-bit)
> Running under: Windows >= 8 x64 (build 9200)
>
> Matrix products: default
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] ggplot2_2.2.1 lattice_0.20-35 knitr_1.17 lmerTest_2.0-33 lme4_1.1-13
> [6] Matrix_1.2-11 reshape2_1.4.2
>
> loaded via a namespace (and not attached):
> [1] Rcpp_0.12.12 highr_0.6 RColorBrewer_1.1-2 compiler_3.4.1
> [5] nloptr_1.0.4 plyr_1.8.4 base64enc_0.1-3 tools_3.4.1
> [9] rpart_4.1-11 digest_0.6.12 checkmate_1.8.3 htmlTable_1.9
> [13] evaluate_0.10.1 tibble_1.3.4 nlme_3.1-131 gtable_0.2.0
> [17] rlang_0.1.2 yaml_2.1.14 gridExtra_2.3 cluster_2.0.6
> [21] stringr_1.2.0 htmlwidgets_0.9 nnet_7.3-12 rprojroot_1.2
> [25] grid_3.4.1 data.table_1.10.4 survival_2.41-3 foreign_0.8-69
> [29] rmarkdown_1.6 latticeExtra_0.6-28 minqa_1.2.4 Formula_1.2-2
> [33] magrittr_1.5 backports_1.1.0 Hmisc_4.0-3 scales_0.5.0
> [37] htmltools_0.3.6 MASS_7.3-47 splines_3.4.1 rsconnect_0.8.5
> [41] colorspace_1.3-2 labeling_0.3 stringi_1.1.5 acepack_1.4.1
> [45] lazyeval_0.2.0 munsell_0.4.3
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-- 
Diana Michl, M.A.
PhD candidate
International Experimental
and Clinical Linguistics
Universität Potsdam
www.ling.uni-potsdam.de/staff/dmichl
www.duoinfernale.eu


	[[alternative HTML version deleted]]



More information about the R-sig-mixed-models mailing list