[R-sig-ME] Failing to understand effect of centering a predictor in multilevel model

Thu Sep 21 09:48:50 CEST 2017

I am trying to make sure I understand the use of multilevel modelling to look at change (on questionnaire scores) across sessions of psychotherapy. This is widely done in my applied field. I am a researcher (and ex-therapist) not a statistician and have no direct availability of a statistician but I'm fairly savvy about stats. I am simulating, rather crudely, data which have both fixed and random intercepts and slopes with a quadratic as well as a linear effect of session number. I'm using lmer from lme4 overloaded by lmerTest in 64 bit R 3.4.1 on Windoze running in Rstudio. 

The thing that's baffling me is that centering my predictor (session count) seems to have NO effect on my results, neither on the fixed nor the random effects. I would have expected the intercept to have changed at the very least given the effects I've simulated. 

Details: session count can range from 1 to 51 (the number of sessions per client in my simulation can vary, realistically, from 4 to 51 so no client has fewer than four). I have modelled in a random linear and quadratic effect of session count but such that the overall effects are both negative. Code and session info are at the end of the post. Here is code and output from uncentred: 

m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID), 
data=longdat4) 
summary(m4) 

## Linear mixed model fit by REML t-tests use Satterthwaite approximations 
## to degrees of freedom [lmerMod] 
## Formula: 
## DEP_TOTAL ~ poly(session, 2) + (1 + poly(session, 2) | validPOC_ID) 
## Data: longdat4 
## 
## REML criterion at convergence: -29507 
## 
## Scaled residuals: 
## Min 1Q Median 3Q Max 
## -3.6274 -0.6583 0.0035 0.6420 4.0849 
## 
## Random effects: 
## Groups Name Variance Std.Dev. Corr 
## validPOC_ID (Intercept) 9.309e+00 3.05101 
## poly(session, 2)1 7.390e+03 85.96686 0.20 
## poly(session, 2)2 4.154e+02 20.38232 0.21 1.00 
## Residual 4.973e-03 0.07052 
## Number of obs: 15399, groups: validPOC_ID, 500 
## 
## Fixed effects: 
## Estimate Std. Error df t value Pr(>|t|) 
## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 *** 
## poly(session, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 *** 
## poly(session, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 *** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Correlation of Fixed Effects: 
## (Intr) p(,2)1 
## ply(sss,2)1 0.210 
## ply(sss,2)2 0.215 0.995 

Here is the centered one (lengthCentre is 29.5). 

longdat4 $ sessionC1 <- longdat4 $ session - lengthCentre summary ( longdat4 $ sessionC1 ) 
## Min. 1st Qu. Median Mean 3rd Qu. Max. 
## -28.50 -21.50 -12.50 -10.09 -0.50 21.50 
m5 <- lmer ( DEP_TOTAL ~ poly ( sessionC1 , 2 ) + ( 1 + poly ( sessionC1 , 2 ) | validPOC_ID ) , data = longdat4 ) summary ( m5 ) 
## Linear mixed model fit by REML t-tests use Satterthwaite approximations 
## to degrees of freedom [lmerMod] 
## Formula: 
## DEP_TOTAL ~ poly(sessionC1, 2) + (1 + poly(sessionC1, 2) | validPOC_ID) 
## Data: longdat4 
## 
## REML criterion at convergence: -29507 
## 
## Scaled residuals: 
## Min 1Q Median 3Q Max 
## -3.6274 -0.6583 0.0035 0.6420 4.0849 
## 
## Random effects: 
## Groups Name Variance Std.Dev. Corr 
## validPOC_ID (Intercept) 9.309e+00 3.05101 
## poly(sessionC1, 2)1 7.390e+03 85.96686 0.20 
## poly(sessionC1, 2)2 4.154e+02 20.38232 0.21 1.00 
## Residual 4.973e-03 0.07052 
## Number of obs: 15399, groups: validPOC_ID, 500 
## 
## Fixed effects: 
## Estimate Std. Error df t value Pr(>|t|) 
## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 *** 
## poly(sessionC1, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 *** 
## poly(sessionC1, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 *** 
## --- 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Correlation of Fixed Effects: 
## (Intr) p(C1,2)1 
## ply(sC1,2)1 0.210 
## ply(sC1,2)2 0.215 0.995 

This makes no sense to me. I had expected that the very crude centering must shift the fixed intercept as the x=0 has shifted left by 29.5 and I have strong linear and quadratic effects in the model. I also thought (the whole reason I'm doing this) that it would change the correlation between the linear and quadratic terms as I should be moving from a strong positive correlation between them to a much smaller one. 

I'm sure I'm doing something utterly stupid, or revealing shocking failure to comprehend the multilevel model, but I'm looking straight through the explanation and stuck. I hope someone will help me! 

TIA, 

Chris 

Here is entire pertinent code. 

```{r sim4,warning=FALSE,message=FALSE,cache=TRUE} 
require(reshape2) 
set.seed(12345) 
### set up the parameters for the simulation 
n <- 500 
minNSessions <- 4 
maxNSessions <- 55 
linSlope <- -.02 
errVar <- .005 
### derived variables 
lengthRange <- minNSessions:maxNSessions 
lengthCentre <- (minNSessions+maxNSessions)/2 
ID <- 1:n 
### create a variable, length = n for the session counts 
sessions <- sample(lengthRange,n,replace=TRUE) 
### for now, give all starting scores of 10 
startScores <- rnorm(n,mean=10,sd=3) 
slopes <- rnorm(n,mean=linSlope,sd=.01) # random slopes 
### set up matrix to store the simulated scores 
scores <- matrix(rep(NA,n*maxNSessions),nrow=n) 
dat4 <- as.data.frame(cbind(ID,sessions,startScores,slopes,scores)) 
for (i in 1:nrow(dat4)) { 
nSessions <- dat4$sessions[i] 
tmpScores <- rep(dat4$startScores[i],nSessions) # scores: no slopes 
tmpScores <- tmpScores + c(0,slopes[i]*(2:nSessions)) # scores: add linear effect 
tmpScores <- tmpScores + c(0,slopes[i]*((2:nSessions)^2))/10 # add quadratic effect 
tmpScores <- tmpScores + rnorm(nSessions,mean=0,sd=sqrt(errVar)) # now add error 
dat4[i,5:(4+nSessions)] <- tmpScores # put into data frame 
} 
### now melt to create a long data frame 
longdat4 <- melt(dat4,id.vars="ID",measure.vars = 4:55) 
colnames(longdat4) <- c("validPOC_ID","wks","DEP_TOTAL") 
longdat4$session <- as.numeric(substr(longdat4$wks,2,3))-4 # minus 4 to get start at 1 
# as data start in column 5 
longdat4 <- longdat4[order(longdat4$validPOC_ID,longdat4$wks),] 
longdat4 <- na.omit(longdat4) # ditch the empty data 
``` 

So the data frame longdat4 has 
`r nrow(longdat4)` rows and 
`r ncol(longdat4)` columns and the column names are 
`r colnames(longdat4)`. Here is a bit of the head of the wide data frame (dat4) and the head of the long one (longdat4). 

```{r head4,warning=FALSE,message=FALSE,cache=TRUE} 
require(knitr) 
kable(head(dat4[,1:10]),digits=3) 
kable(head(longdat4)) 
``` 

Elisa's lattice plot. 

```{r spaghetti4,warning=FALSE,message=FALSE,cache=TRUE} 
library(lattice) 
samp <- sample(unique(longdat4$validPOC_ID), 35) 
xyplot(DEP_TOTAL ~ wks|validPOC_ID, longdat4, 
type = c("p","g","r"), 
subset = (validPOC_ID %in% samp), 
col="dark blue", 
col.line="black", 
strip = F, 
xlab = "Weeks in treatment", 
ylab = "Depression Scores") 
``` 

Now Elisa's tangle plot. Just first 40 simulated cases. 

```{r tangle4, warning=FALSE,message=FALSE,cache=TRUE} 
library(ggplot2) 
first40 <- longdat4[longdat4$validPOC_ID <=40,] 
tanglePlot <- ggplot(data = longdat4, aes(x = wks, y = DEP_TOTAL)) + 
geom_line(data = longdat4, alpha = 0.5, aes(group = validPOC_ID)) + 
theme_bw() 
tanglePlot 
``` 

```{r model4, warning=FALSE, message=FALSE,cache=TRUE} 
require(lmerTest) 
### this next fails to converge and has perfect correlation of linear & quadratic effects 
# m4 <- lmer(DEP_TOTAL ~ session+I(session^2)+(1+session+I(session^2)|validPOC_ID), 
# data=longdat4) 
### is this better: 
summary(longdat4$session) 
hist(longdat4$session, 
main="Histogram of raw session counts", 
xlab="Raw session n", 
col="grey") 
m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID), 
data=longdat4) 
summary(m4) 
``` 

This next should test the random effect (and should find it highly significant I think). 

```{r mode4rand,eval=FALSE} 
rand(m4) 
### fails with message: 
### Error in terms.formula(tmp, simplify = TRUE) : 
### invalid model formula in ExtractVars 
``` 

And this summarises the random effects. These are deviations from the fixed effects so both are centred on zero. 

```{r coeffs4,fig.width=11,fig.height=8,cache=TRUE} 
apply(ranef(m4)$validPOC_ID,2,summary) 
par(mfrow=c(1,3)) 
hist(ranef(m4)$validPOC_ID[,1], 
xlab="Intercept deviation values", 
main="Random intercepts", 
col="grey") 
abline(v=0) 
hist(ranef(m4)$validPOC_ID[,2], 
main="Random slopes", 
xlab="Linear slope deviation values", 
col="grey") 
abline(v=0) 
hist(ranef(m4)$validPOC_ID[,3], 
main="Random slopes", 
xlab="Quadratic slope deviation values", 
col="grey") 
abline(v=0) 
``` 

# Simulation 5: random intercept and random slope with quadratic effect, CENTRED 

OK. Random intercept and slope has linear and quadratic effects. Now going to centre the session count by subtracting lengthCentre 

```{r model5, warning=FALSE, message=FALSE} 
longdat4$sessionC1 <- longdat4$session - lengthCentre 
summary(longdat4$sessionC1) 
hist(longdat4$sessionC1, 
main="Histogram of centred session counts", 
xlab="Centred session n", 
col="grey") 
abline(v=0) 
require(lmerTest) 
m5 <- lmer(DEP_TOTAL ~ poly(sessionC1,2)+(1+poly(sessionC1,2)|validPOC_ID), 
data=longdat4) 
summary(m5) 
``` 

```{r mode5rand,eval=FALSE} 
rand(m5) 
### fails with message: 
### Error in terms.formula(tmp, simplify = TRUE) : 
### invalid model formula in ExtractVars 
``` 

```{r coeffs5,fig.width=11,fig.height=8} 
apply(ranef(m5)$validPOC_ID,2,summary) 
par(mfrow=c(1,3)) 
hist(ranef(m5)$validPOC_ID[,1], 
xlab="Intercept deviation values", 
main="Random intercepts", 
col="grey") 
abline(v=0) 
hist(ranef(m5)$validPOC_ID[,2], 
main="Random slopes", 
xlab="Linear slope deviation values", 
col="grey") 
abline(v=0) 
hist(ranef(m5)$validPOC_ID[,3], 
main="Random slopes", 
xlab="Quadratic slope deviation values", 
col="grey") 
abline(v=0) 
``` 

And here is sessionInfo() 

> sessionInfo() 
R version 3.4.1 (2017-06-30) 
Platform: x86_64-w64-mingw32/x64 (64-bit) 
Running under: Windows >= 8 x64 (build 9200) 

Matrix products: default 

locale: 
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C 
[5] LC_TIME=English_United Kingdom.1252 

attached base packages: 
[1] stats graphics grDevices utils datasets methods base 

other attached packages: 
[1] ggplot2_2.2.1 lattice_0.20-35 knitr_1.17 lmerTest_2.0-33 lme4_1.1-13 
[6] Matrix_1.2-11 reshape2_1.4.2 

loaded via a namespace (and not attached): 
[1] Rcpp_0.12.12 highr_0.6 RColorBrewer_1.1-2 compiler_3.4.1 
[5] nloptr_1.0.4 plyr_1.8.4 base64enc_0.1-3 tools_3.4.1 
[9] rpart_4.1-11 digest_0.6.12 checkmate_1.8.3 htmlTable_1.9 
[13] evaluate_0.10.1 tibble_1.3.4 nlme_3.1-131 gtable_0.2.0 
[17] rlang_0.1.2 yaml_2.1.14 gridExtra_2.3 cluster_2.0.6 
[21] stringr_1.2.0 htmlwidgets_0.9 nnet_7.3-12 rprojroot_1.2 
[25] grid_3.4.1 data.table_1.10.4 survival_2.41-3 foreign_0.8-69 
[29] rmarkdown_1.6 latticeExtra_0.6-28 minqa_1.2.4 Formula_1.2-2 
[33] magrittr_1.5 backports_1.1.0 Hmisc_4.0-3 scales_0.5.0 
[37] htmltools_0.3.6 MASS_7.3-47 splines_3.4.1 rsconnect_0.8.5 
[41] colorspace_1.3-2 labeling_0.3 stringi_1.1.5 acepack_1.4.1 
[45] lazyeval_0.2.0 munsell_0.4.3 
> 

	[[alternative HTML version deleted]]