[R-sig-ME] Failing to understand effect of centering a predictor in multilevel model

Thierry Onkelinx thierry.onkelinx at inbo.be
Thu Sep 21 15:27:17 CEST 2017


Dear Chris,

Collinearity is not an issue because poly() creates orthogonal
polynomials. I expect that the problem is with the distribution of the
session variable. Does the range of "session" depends on the subject?
In the example below, each is has a session "1", but very few have a
session "50". In this case the model has plenty of data for low values
of session but not for high values. Making it harder to fit a
quadratic term.

set.seed(1234)
n_id <- 100
n_session <- rnbinom(n_id, size = 1, prob = 0.06)
n_session <- pmax(1, pmin(50, n_session))
dataset <- do.call(
  rbind,
  lapply(
    seq_along(n_session),
    function(i){
      data.frame(id = i, session = seq_len(n_session[i]))
    }
  )
)
fixed <- matrix(runif(3), ncol = 1)
random <- matrix(runif(3 * n_id), nrow = n_id)

dataset$fixed <- as.vector(model.matrix(id ~ poly(session, 2), data =
dataset) %*% fixed)
dataset$random <- rowSums(model.matrix(id ~ poly(session, 2), data =
dataset) * random[dataset$id, ])
dataset$mu <- dataset$fixed + dataset$random
dataset$y <- rnorm(nrow(dataset), mean = dataset$mu, sd = 0.05)
library(ggplot2)
ggplot(dataset, aes(x = session, y = mu, group = id)) +
  geom_line()
ggplot(dataset, aes(x = session, y = y, group = id)) +
  geom_line()

library(lme4)
model <- lmer(y ~ poly(session, 2) + (poly(session, 2) | id), data = dataset)
summary(model)

Best regards,

ir. Thierry Onkelinx
Statisticus/ Statiscian

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx op inbo.be
Kliniekstraat 25, B-1070 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////


Van 14 tot en met 19 december 2017 verhuizen we uit onze vestiging in
Brussel naar het Herman Teirlinckgebouw op de site Thurn & Taxis.
Vanaf dan ben je welkom op het nieuwe adres: Havenlaan 88 bus 73, 1000 Brussel.

///////////////////////////////////////////////////////////////////////////////////////////



2017-09-21 12:22 GMT+02:00 Chris Evans <chrishold op psyctc.org>:
> Oh dear, many, many thanks Thierry,
>
> Of course.  I knew it must be something idiotic I was doing.
>
> I think what had thrown me off from seeing that was the correlations in the fixed and random estimates.  As I read them (see output in my original post), the correlation between the fixed estimates of the linear and quadratic terms came out as .995 and the correlation for the random effects as 1.0  I had, perhaps naively, assumed that with linear and quadratic effects built into the population model with x (sessions) all positive would have inevitable strong collinearity and that was what I was seeing in the raw run and that centering would reduce it.
>
> I think, in some stupid way, I had just forgotten that the use of poly() (instead of "+ session + I(session^2)" ?) was transforming things in a way that makes any linear transform of session irrelevent (as you nicely point out!)
>
> So, can I put three more short follow up questions here:
> 1) am I reading those correlations correctly?
> 2) if so, then clearly I've misunderstood poly(): I thought it would have created essentially uncorrelated predictors, and cor(cbind(x,x^2,poly(x,2)) from your example seems to fit with that.  I thought that would have reduced or removed the inevitable collinearity of a linear and a quadratic effect and correlations of .995 and 1.0 seem not very reduced!
> 3) can you or someone point me to reading on how to interpret the correlations as warnings of possibly misleading estimates?
>
> TIA,
>
> Chris
>
>
> ----- Original Message -----
>> From: "Thierry Onkelinx" <thierry.onkelinx op inbo.be>
>> To: "Chris Evans" <chrishold op psyctc.org>
>> Cc: r-sig-mixed-models op r-project.org
>> Sent: Thursday, 21 September, 2017 10:36:40
>> Subject: Re: [R-sig-ME] Failing to understand effect of centering a predictor in multilevel model
>
>> Dear Chris,
>>
>> This is because you use poly() on all variables
>>
>> x <- 10:20
>> cx <- x - mean(x)
>> all.equal(poly(x, 2), poly(cx, 2), check.attributes = FALSE)
>> all.equal(poly(x, 2), poly(cx, 2))
>>
>> Best regards,
>>
>> ir. Thierry Onkelinx
>> Statisticus/ Statiscian
>>
>> Vlaamse Overheid / Government of Flanders
>> INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
>> AND FOREST
>> Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
>> thierry.onkelinx op inbo.be
>> Kliniekstraat 25, B-1070 Brussel
>> www.inbo.be
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>> To call in the statistician after the experiment is done may be no
>> more than asking him to perform a post-mortem examination: he may be
>> able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
>> The plural of anecdote is not data. ~ Roger Brinner
>> The combination of some data and an aching desire for an answer does
>> not ensure that a reasonable answer can be extracted from a given body
>> of data. ~ John Tukey
>> ///////////////////////////////////////////////////////////////////////////////////////////
>>
>>
>> Van 14 tot en met 19 december 2017 verhuizen we uit onze vestiging in
>> Brussel naar het Herman Teirlinckgebouw op de site Thurn & Taxis.
>> Vanaf dan ben je welkom op het nieuwe adres: Havenlaan 88 bus 73, 1000 Brussel.
>>
>> ///////////////////////////////////////////////////////////////////////////////////////////
>>
>>
>>
>> 2017-09-21 11:29 GMT+02:00 Chris Evans <chrishold op psyctc.org>:
>>> I am trying to make sure I understand the use of multilevel modelling to look at
>>> change (on questionnaire scores) across sessions of psychotherapy. This is
>>> widely done in my applied field. I am a researcher (and ex-therapist) not a
>>> statistician and have no direct availability of a statistician but I'm fairly
>>> savvy about stats. I am simulating, rather crudely, data which have both fixed
>>> and random intercepts and slopes with a quadratic as well as a linear effect of
>>> session number. I'm using lmer from lme4 overloaded by lmerTest in 64 bit R
>>> 3.4.1 on Windoze running in Rstudio.
>>>
>>> The thing that's baffling me is that centering my predictor (session count)
>>> seems to have NO effect on my results, neither on the fixed nor the random
>>> effects. I would have expected the intercept to have changed at the very least
>>> given the effects I've simulated.
>>>
>>> Details: session count can range from 1 to 51 (the number of sessions per client
>>> in my simulation can vary, realistically, from 4 to 51 so no client has fewer
>>> than four). I have modelled in a random linear and quadratic effect of session
>>> count but such that the overall effects are both negative. Code and session
>>> info are at the end of the post. Here is code and output from uncentred:
>>>
>>> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
>>> data=longdat4)
>>> summary(m4)
>>>
>>> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
>>> ## to degrees of freedom [lmerMod]
>>> ## Formula:
>>> ## DEP_TOTAL ~ poly(session, 2) + (1 + poly(session, 2) | validPOC_ID)
>>> ## Data: longdat4
>>> ##
>>> ## REML criterion at convergence: -29507
>>> ##
>>> ## Scaled residuals:
>>> ## Min 1Q Median 3Q Max
>>> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
>>> ##
>>> ## Random effects:
>>> ## Groups Name Variance Std.Dev. Corr
>>> ## validPOC_ID (Intercept) 9.309e+00 3.05101
>>> ## poly(session, 2)1 7.390e+03 85.96686 0.20
>>> ## poly(session, 2)2 4.154e+02 20.38232 0.21 1.00
>>> ## Residual 4.973e-03 0.07052
>>> ## Number of obs: 15399, groups: validPOC_ID, 500
>>> ##
>>> ## Fixed effects:
>>> ## Estimate Std. Error df t value Pr(>|t|)
>>> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
>>> ## poly(session, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
>>> ## poly(session, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
>>> ## ---
>>> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>> ##
>>> ## Correlation of Fixed Effects:
>>> ## (Intr) p(,2)1
>>> ## ply(sss,2)1 0.210
>>> ## ply(sss,2)2 0.215 0.995
>>>
>>> Here is the centered one (lengthCentre is 29.5).
>>>
>>> longdat4 $ sessionC1 <- longdat4 $ session - lengthCentre summary ( longdat4 $
>>> sessionC1 )
>>> ## Min. 1st Qu. Median Mean 3rd Qu. Max.
>>> ## -28.50 -21.50 -12.50 -10.09 -0.50 21.50
>>> m5 <- lmer ( DEP_TOTAL ~ poly ( sessionC1 , 2 ) + ( 1 + poly ( sessionC1 , 2 ) |
>>> validPOC_ID ) , data = longdat4 ) summary ( m5 )
>>> ## Linear mixed model fit by REML t-tests use Satterthwaite approximations
>>> ## to degrees of freedom [lmerMod]
>>> ## Formula:
>>> ## DEP_TOTAL ~ poly(sessionC1, 2) + (1 + poly(sessionC1, 2) | validPOC_ID)
>>> ## Data: longdat4
>>> ##
>>> ## REML criterion at convergence: -29507
>>> ##
>>> ## Scaled residuals:
>>> ## Min 1Q Median 3Q Max
>>> ## -3.6274 -0.6583 0.0035 0.6420 4.0849
>>> ##
>>> ## Random effects:
>>> ## Groups Name Variance Std.Dev. Corr
>>> ## validPOC_ID (Intercept) 9.309e+00 3.05101
>>> ## poly(sessionC1, 2)1 7.390e+03 85.96686 0.20
>>> ## poly(sessionC1, 2)2 4.154e+02 20.38232 0.21 1.00
>>> ## Residual 4.973e-03 0.07052
>>> ## Number of obs: 15399, groups: validPOC_ID, 500
>>> ##
>>> ## Fixed effects:
>>> ## Estimate Std. Error df t value Pr(>|t|)
>>> ## (Intercept) 8.6192 0.1367 526.0000 63.07 <2e-16 ***
>>> ## poly(sessionC1, 2)1 -179.5508 3.9684 4712.0000 -45.25 <2e-16 ***
>>> ## poly(sessionC1, 2)2 -42.1333 0.9497 4029.0000 -44.37 <2e-16 ***
>>> ## ---
>>> ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>> ##
>>> ## Correlation of Fixed Effects:
>>> ## (Intr) p(C1,2)1
>>> ## ply(sC1,2)1 0.210
>>> ## ply(sC1,2)2 0.215 0.995
>>>
>>> This makes no sense to me. I had expected that the very crude centering must
>>> shift the fixed intercept as the x=0 has shifted left by 29.5 and I have strong
>>> linear and quadratic effects in the model. I also thought (the whole reason I'm
>>> doing this) that it would change the correlation between the linear and
>>> quadratic terms as I should be moving from a strong positive correlation
>>> between them to a much smaller one.
>>>
>>> I'm sure I'm doing something utterly stupid, or revealing shocking failure to
>>> comprehend the multilevel model, but I'm looking straight through the
>>> explanation and stuck. I hope someone will help me!
>>>
>>> TIA,
>>>
>>>
>>> Chris
>>>
>>>
>>>
>>> Here is entire pertinent code.
>>>
>>> ```{r sim4,warning=FALSE,message=FALSE,cache=TRUE}
>>> require(reshape2)
>>> set.seed(12345)
>>> ### set up the parameters for the simulation
>>> n <- 500
>>> minNSessions <- 4
>>> maxNSessions <- 55
>>> linSlope <- -.02
>>> errVar <- .005
>>> ### derived variables
>>> lengthRange <- minNSessions:maxNSessions
>>> lengthCentre <- (minNSessions+maxNSessions)/2
>>> ID <- 1:n
>>> ### create a variable, length = n for the session counts
>>> sessions <- sample(lengthRange,n,replace=TRUE)
>>> ### for now, give all starting scores of 10
>>> startScores <- rnorm(n,mean=10,sd=3)
>>> slopes <- rnorm(n,mean=linSlope,sd=.01) # random slopes
>>> ### set up matrix to store the simulated scores
>>> scores <- matrix(rep(NA,n*maxNSessions),nrow=n)
>>> dat4 <- as.data.frame(cbind(ID,sessions,startScores,slopes,scores))
>>> for (i in 1:nrow(dat4)) {
>>> nSessions <- dat4$sessions[i]
>>> tmpScores <- rep(dat4$startScores[i],nSessions) # scores: no slopes
>>> tmpScores <- tmpScores + c(0,slopes[i]*(2:nSessions)) # scores: add linear
>>> effect
>>> tmpScores <- tmpScores + c(0,slopes[i]*((2:nSessions)^2))/10 # add quadratic
>>> effect
>>> tmpScores <- tmpScores + rnorm(nSessions,mean=0,sd=sqrt(errVar)) # now add error
>>> dat4[i,5:(4+nSessions)] <- tmpScores # put into data frame
>>> }
>>> ### now melt to create a long data frame
>>> longdat4 <- melt(dat4,id.vars="ID",measure.vars = 4:55)
>>> colnames(longdat4) <- c("validPOC_ID","wks","DEP_TOTAL")
>>> longdat4$session <- as.numeric(substr(longdat4$wks,2,3))-4 # minus 4 to get
>>> start at 1
>>> # as data start in column 5
>>> longdat4 <- longdat4[order(longdat4$validPOC_ID,longdat4$wks),]
>>> longdat4 <- na.omit(longdat4) # ditch the empty data
>>> ```
>>>
>>> So the data frame longdat4 has
>>> `r nrow(longdat4)` rows and
>>> `r ncol(longdat4)` columns and the column names are
>>> `r colnames(longdat4)`. Here is a bit of the head of the wide data frame (dat4)
>>> and the head of the long one (longdat4).
>>>
>>> ```{r head4,warning=FALSE,message=FALSE,cache=TRUE}
>>> require(knitr)
>>> kable(head(dat4[,1:10]),digits=3)
>>> kable(head(longdat4))
>>> ```
>>>
>>> Elisa's lattice plot.
>>>
>>> ```{r spaghetti4,warning=FALSE,message=FALSE,cache=TRUE}
>>> library(lattice)
>>> samp <- sample(unique(longdat4$validPOC_ID), 35)
>>> xyplot(DEP_TOTAL ~ wks|validPOC_ID, longdat4,
>>> type = c("p","g","r"),
>>> subset = (validPOC_ID %in% samp),
>>> col="dark blue",
>>> col.line="black",
>>> strip = F,
>>> xlab = "Weeks in treatment",
>>> ylab = "Depression Scores")
>>> ```
>>>
>>> Now Elisa's tangle plot. Just first 40 simulated cases.
>>>
>>> ```{r tangle4, warning=FALSE,message=FALSE,cache=TRUE}
>>> library(ggplot2)
>>> first40 <- longdat4[longdat4$validPOC_ID <=40,]
>>> tanglePlot <- ggplot(data = longdat4, aes(x = wks, y = DEP_TOTAL)) +
>>> geom_line(data = longdat4, alpha = 0.5, aes(group = validPOC_ID)) +
>>> theme_bw()
>>> tanglePlot
>>> ```
>>>
>>> ```{r model4, warning=FALSE, message=FALSE,cache=TRUE}
>>> require(lmerTest)
>>> ### this next fails to converge and has perfect correlation of linear &
>>> quadratic effects
>>> # m4 <- lmer(DEP_TOTAL ~
>>> session+I(session^2)+(1+session+I(session^2)|validPOC_ID),
>>> # data=longdat4)
>>> ### is this better:
>>> summary(longdat4$session)
>>> hist(longdat4$session,
>>> main="Histogram of raw session counts",
>>> xlab="Raw session n",
>>> col="grey")
>>> m4 <- lmer(DEP_TOTAL ~ poly(session,2)+(1+poly(session,2)|validPOC_ID),
>>> data=longdat4)
>>> summary(m4)
>>> ```
>>>
>>> This next should test the random effect (and should find it highly significant I
>>> think).
>>>
>>> ```{r mode4rand,eval=FALSE}
>>> rand(m4)
>>> ### fails with message:
>>> ### Error in terms.formula(tmp, simplify = TRUE) :
>>> ### invalid model formula in ExtractVars
>>> ```
>>>
>>> And this summarises the random effects. These are deviations from the fixed
>>> effects so both are centred on zero.
>>>
>>> ```{r coeffs4,fig.width=11,fig.height=8,cache=TRUE}
>>> apply(ranef(m4)$validPOC_ID,2,summary)
>>> par(mfrow=c(1,3))
>>> hist(ranef(m4)$validPOC_ID[,1],
>>> xlab="Intercept deviation values",
>>> main="Random intercepts",
>>> col="grey")
>>> abline(v=0)
>>> hist(ranef(m4)$validPOC_ID[,2],
>>> main="Random slopes",
>>> xlab="Linear slope deviation values",
>>> col="grey")
>>> abline(v=0)
>>> hist(ranef(m4)$validPOC_ID[,3],
>>> main="Random slopes",
>>> xlab="Quadratic slope deviation values",
>>> col="grey")
>>> abline(v=0)
>>> ```
>>>
>>>
>>> # Simulation 5: random intercept and random slope with quadratic effect, CENTRED
>>>
>>> OK. Random intercept and slope has linear and quadratic effects. Now going to
>>> centre the session count by subtracting lengthCentre
>>>
>>>
>>> ```{r model5, warning=FALSE, message=FALSE}
>>> longdat4$sessionC1 <- longdat4$session - lengthCentre
>>> summary(longdat4$sessionC1)
>>> hist(longdat4$sessionC1,
>>> main="Histogram of centred session counts",
>>> xlab="Centred session n",
>>> col="grey")
>>> abline(v=0)
>>> require(lmerTest)
>>> m5 <- lmer(DEP_TOTAL ~ poly(sessionC1,2)+(1+poly(sessionC1,2)|validPOC_ID),
>>> data=longdat4)
>>> summary(m5)
>>> ```
>>>
>>>
>>>
>>> ```{r mode5rand,eval=FALSE}
>>> rand(m5)
>>> ### fails with message:
>>> ### Error in terms.formula(tmp, simplify = TRUE) :
>>> ### invalid model formula in ExtractVars
>>> ```
>>>
>>>
>>> ```{r coeffs5,fig.width=11,fig.height=8}
>>> apply(ranef(m5)$validPOC_ID,2,summary)
>>> par(mfrow=c(1,3))
>>> hist(ranef(m5)$validPOC_ID[,1],
>>> xlab="Intercept deviation values",
>>> main="Random intercepts",
>>> col="grey")
>>> abline(v=0)
>>> hist(ranef(m5)$validPOC_ID[,2],
>>> main="Random slopes",
>>> xlab="Linear slope deviation values",
>>> col="grey")
>>> abline(v=0)
>>> hist(ranef(m5)$validPOC_ID[,3],
>>> main="Random slopes",
>>> xlab="Quadratic slope deviation values",
>>> col="grey")
>>> abline(v=0)
>>> ```
>>>
>>> And here is sessionInfo()
>>>
>>>> sessionInfo()
>>> R version 3.4.1 (2017-06-30)
>>> Platform: x86_64-w64-mingw32/x64 (64-bit)
>>> Running under: Windows >= 8 x64 (build 9200)
>>>
>>> Matrix products: default
>>>
>>> locale:
>>> [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
>>> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
>>> [5] LC_TIME=English_United Kingdom.1252
>>>
>>> attached base packages:
>>> [1] stats graphics grDevices utils datasets methods base
>>>
>>> other attached packages:
>>> [1] ggplot2_2.2.1 lattice_0.20-35 knitr_1.17 lmerTest_2.0-33 lme4_1.1-13
>>> [6] Matrix_1.2-11 reshape2_1.4.2
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Rcpp_0.12.12 highr_0.6 RColorBrewer_1.1-2 compiler_3.4.1
>>> [5] nloptr_1.0.4 plyr_1.8.4 base64enc_0.1-3 tools_3.4.1
>>> [9] rpart_4.1-11 digest_0.6.12 checkmate_1.8.3 htmlTable_1.9
>>> [13] evaluate_0.10.1 tibble_1.3.4 nlme_3.1-131 gtable_0.2.0
>>> [17] rlang_0.1.2 yaml_2.1.14 gridExtra_2.3 cluster_2.0.6
>>> [21] stringr_1.2.0 htmlwidgets_0.9 nnet_7.3-12 rprojroot_1.2
>>> [25] grid_3.4.1 data.table_1.10.4 survival_2.41-3 foreign_0.8-69
>>> [29] rmarkdown_1.6 latticeExtra_0.6-28 minqa_1.2.4 Formula_1.2-2
>>> [33] magrittr_1.5 backports_1.1.0 Hmisc_4.0-3 scales_0.5.0
>>> [37] htmltools_0.3.6 MASS_7.3-47 splines_3.4.1 rsconnect_0.8.5
>>> [41] colorspace_1.3-2 labeling_0.3 stringi_1.1.5 acepack_1.4.1
>>> [45] lazyeval_0.2.0 munsell_0.4.3
>>>>
>>>
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-mixed-models op r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> _______________________________________________
> R-sig-mixed-models op r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models



More information about the R-sig-mixed-models mailing list