[R-sig-ME] need help with predicted values from lmer with newdata

Joshua Wiley jwiley.psych at gmail.com
Fri Jun 22 11:19:29 CEST 2012


Hi Paul,

I think this may be a bug in predict.merMod.  It is not terribly
elegant, but I suggest changing line 102 from:

re_new <- unlist(re_List[m])

to

re_new <- as.vector(do.call("rbind", re_List[m]))

I believe the issue with the current is that the random effects are
unlisted which essentially 'stacks' the vector, so you have the random
effects one at a time by effect, not by level (i.e., all intercepts,
then all slopes).  Later this is post multiplied by a matrix that is
sorted by level, the dimensions are correct so the matrices multiply
fine, but the results are not what I believe was intended.

I cannot see a way to add a patch to R-forge or submit a bug report so
I cced Doug on this.

Cheers,

Josh

On Thu, Jun 21, 2012 at 8:28 PM, Paul Johnson <pauljohn32 at gmail.com> wrote:
> Greetings.
>
> I'm getting some crazy results out of predict on an lmer model.  I've
> gone through
> this lots of times, and feel certain I'm missing some obvious mistake.
>  If you browse
> through and get to the bottom, there's some output that shows my predict output
> is just wrong, but I can't see why.
>
> I will send the first person with the right answer a new KU t-shirt in
> the size of
> your choice.  That's better than "thanks in advance". Isn't it?
>
> ## Paul Johnson <pauljohn at ku.edu>
> ## 2012-06-21
>
> ## I want to compare the "separate ols regressions on clusters"
> ## results with the mixed model estimates. I've run into
> ## trouble because the output of predict with a newdata object
> ## seems to be completely wrong.  I demonstrate "manual" calculations
> ## to try to convince you I understand what's going on (but
> ## peculiarity of predict output may make things appear otherwise).
>
> ## I pasted in some output at the bottom in case you just want to
> ## see where I think there is a problem.
>
> ## My system:
> ## > sessionInfo()
> ## R version 2.15.0 (2012-03-30)
> ## Platform: x86_64-pc-linux-gnu (64-bit)
>
> ## locale:
> ##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
> ##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
> ##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
> ##  [7] LC_PAPER=C                 LC_NAME=C
> ##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> ## attached base packages:
> ## [1] stats     graphics  grDevices utils     datasets  methods   base
>
> ## other attached packages:
> ## [1] lme4_0.999902344-0 Matrix_1.0-6       lattice_0.20-6     MASS_7.3-18
> ## loaded via a namespace (and not attached):
> ## [1] compiler_2.15.0 grid_2.15.0     minqa_1.2.1     nlme_3.1-104
> ## [5] splines_2.15.0  tools_2.15.0
>
>
> ## Topic: create multi-level data and use regression and lmer to
> ## estimate it.  This creates 3 different dependent variables,
> ## y1: ordinary "homoskedastic" regression (no grouping effect)
> ## y2: clustered "random intercepts"
> ## y3: clustered "random intercepts" and "random slope" for 1 predictor (x1)
>
> library(MASS)
> set.seed(1234)
>
> ## Step 1. Create a Data Frame.
>
> ## M respondents, N observations for each one.
> ## In a repeated measures context, this is called "longitudinal data".
> ## In cross sectional approach, this is M groups (classrooms) with
> ## several people in each)
>
> M <- 10
> N <- 30
>
> ## get M unique gray colors, will use later for plotting
> grays <- gray.colors(M)
>
> ## Standard deviation of error term at individual level
> STDE <- 48
>
> ## STDEM: standard deviation of clustered intercepts.
> ### In a longitudinal "repeated measures" exercise, this is an
> ## individual-level effect. In a cross section, this is a random
> ## classroom effect.
> STDEM <- 30
>
> ## STEx1: standard deviation of slopes across cluster units
> STDEx1 <- 5
>
> ## The true effects of b0, b1, b2, b3 in
> ## y = b0 + b1*x1 + b2*x2 + b3*x3
> bslopes <- c(0.2, 15.4, -0.2, 3)
>
> ## Now generate the data frame with x1, x2 and x3.
> ## Let's suppose the predictors are Multivariate Normal, with
> ## means (100,200,150), standard deviations (10, 20, 30), and
> ## intercorrelations of 0.4.  These can, of course, be adjusted
> ## for interesting variations.
>
> ## Mind, an "indicator" that says with which cluster an observation belongs.
> Mind <- 1:M %x% rep(1,N)
>
> means <- c(100, 200, 150)
> sds <- c(10, 20, 30)
> rho <- 0.4
> corr.mat <- matrix(c(1, rho, rho, rho, 1, rho, rho, rho, 1), nrow = 3)
> sigma <- diag(sds) %*% corr.mat %*% diag(sds)
> x.mat <- mvrnorm(n = N * M, mu = means, Sigma = sigma)
> dimnames(x.mat)[[2]] <- c("x1", "x2", "x3")
> ## Add an intercept on the front of x.mat
> x.mat <- cbind("(Intercept)"= 1, x.mat)
>
>
> ## Create a dependent variable that has no clustering effects.  This
> ## is an "ordinary" regression with slopes and random error designated
> ## above random noise
> y1 <- x.mat %*% bslopes + rnorm(M*N, m=0, s= STDE)
> dat <- data.frame(id=1:(N*M), Mind, y1, as.data.frame(x.mat))
> rm(Mind, y1, x.mat) ## cleanup workspace
>
> ## Layer on additive group level error in y2
> ## Add a group-level intercept.
> ## A vector of M disturbances is created, and each is
> ## added to all N case within its cluster
> dat$y2 <- dat$y1 +  rnorm(M, 0, STDEM) %x% rep(1, N)
>
> ## In y3, add in random slope effect
> ## are changing the slope for x1 from b1 to N(b1, STDEx1^2).
> ## Same as newb1 ~ b1 + N(0, STDEx1) * x1
> dat$y3 <- dat$y2 +   (rnorm(M, 0, STDEx1) %x% rep(1, N)) * dat$x1
>
> ## In the lme4 package, there is an "easy" tool to run M separate lm regressions
> library(lme4)
>
> ##100 separate regressions, with 3 predictors
> m3list <- lmList(y3 ~ x1 + x2 + x3 | Mind, data=dat, pool = FALSE)
>
>
> ## Now I see a pattern. I'll use the unique range on x1 for each cluster
> ## I think that makes the lines look more "in" the data.
> plot(y3 ~ x1, data=dat, col=grays[Mind], main = "lm on clusters")
> for( i in seq_along(m3list)){
>    m3mf <- model.frame(m3list[[i]]) #data set for group i
>    x1range <- range(m3mf$x1) ## use group-specific ranges this time
>    pgroup <- predict( m3list[[i]], newdata = data.frame(x1=x1range,
> x2=mean(m3mf$x2), x3=mean(m3mf$x3)))
>    lines(x1range, pgroup, col=grays[i])
> }
>
> ## Hmm. That data looks like it might be handy for comparison later on.
> ## Build a data frame of it. I've never thought of doing this work
> ## cluster-by-cluster before.
>
> m3newdat <- lapply(m3list, function(x) {
>    m3mf <- model.frame(x)
>    ndf = data.frame(x1=range(m3mf$x1), x2=mean(m3mf$x2), x3=mean(m3mf$x3))
>    ndf$m3pred <- predict(x, newdata = ndf)
>    ndf} )
> ## Smash the list of data frames together
> m3newdat <- do.call("rbind", m3newdat)
> ## Interesting!
> m3newdat[1:20, ]
> ## Better add a variable Mind. This looks stupid, but works.
> m3newdat$Mind <- as.integer( do.call("rbind",
> strsplit(row.names(m3newdat), split="\\.", perl=T))[ ,1] )
>
>
> ## Draw new graphs on a new device, so we can compare
> dev.new()
>
> ## The "right" model allows the random effects to be correlated.
> mm4 <- lmer( y3 ~ x1 + x2 + x3 + (x1 | Mind), data=dat)
> summary(mm4)
>
>
>
> mm4ranef <- ranef(mm4, postVar = TRUE) ## a ranef.mer object, a list
> that includes one data frame
>
>
> ### What do the predicted values mean from lmer?
> ### here they are. But what do they include?
> head(mm4pred <- predict(mm4))
> tail(mm4pred <- predict(mm4))
>
>
> ## I can't tell what those numbers are. Can't be sure from ?predict.merMod
> ## So I'm going to go through the step-by-step
> ## process of calculating predicted values. These should follow a
> formula like this
> ## for model 4. j is the "cluster" index.
> ##
> ##  j'th intercept                j'th slope
> ##  (b0 + raneff(intercept, j)) + (b1 + raneff(x1,j))*x1 + b2 x2 + b3 x3
>
> mm4b <- coef(summary(mm4))[ ,1]
> mm4vnames <- names(mm4b)
> mm4mm <- model.matrix(mm4) ## predictors
>
> ## Doing this bit by bit. First
> ## b0 + raneff(intercept, j)
> mm4inteffect <- mm4b["(Intercept)"] +  mm4ranef[[1]][dat$Mind, 1]
> mm4x1effect <- mm4mm[ , c("x1")] * (mm4b["x1"] + mm4ranef[[1]][dat$Mind, 2])
> mm4pred2 <- mm4inteffect + mm4x1effect +  mm4mm[ ,c("x2","x3") ] %*%
> mm4b[c("x2","x3")]
> head(mm4pred2)
> tail(mm4pred2)
>
> ## Aha! Those exactly match predict for mm4. So I understand what "predict" does
>
>
>
>
> ## Now, I want to run predict for some particular values of x1, x2, x3.
>
> ## Let's try running m3newdat through the predict function for mm4.
> m3newdat$mm4.3 <- predict(mm4, newdata = m3newdat)
> m3newdat
>
>
> ## Disaster. Predicted values completely out of whack. Lots of
> ## lines fall off the bottom of the graph.
>
> plot(y3 ~ x1, data=dat, col=grays[Mind], main="lmer mixed model predictions")
> by(m3newdat, m3newdat$Mind, function(x) { lines(x$x1, x$mm4.3,
> col=grays[x$Mind])})
>
>
> ## They don't match the manual calculations I perform here
> mm4b <- coef(summary(mm4))[ ,1]
> mm4vnames <- names(mm4b)
> mm4mm <- as.matrix(m3newdat)
>
> ##  b0 + raneff(intercept, j)
> mm4inteffect <- mm4b["(Intercept)"] + rep(mm4ranef[[1]][, 1], each=2)
> mm4x1effect <- mm4mm[ , c("x1")] * rep(mm4b["x1"] + mm4ranef[[1]][, 2], each=2)
> mm4manualpred2 <- mm4inteffect + mm4x1effect +  mm4mm[ ,c("x2","x3") ]
> %*%  mm4b[c("x2","x3")]
>
> mm4mm <- cbind(as.data.frame(mm4mm), "mm4manualpred" = mm4manualpred2)
> mm4mm
>
> ## Here's what I see when I do that. m3pred: the lmList predictions.
> ## mm4manualpred: my "manual" calculations of predicted values for lmer mm4
> ## mm4.3 output from predict for same "newdat" used with mm4manualpred.
>
> ## > mm4mm
> ##             x1       x2       x3   m3pred Mind      mm4.3 mm4manualpred
> ## 1.1   88.51074 196.0615 141.3859 1221.477    1  -434.7602      1239.152
> ## 1.2  110.88258 196.0615 141.3859 1483.957    1  -630.0463      1466.961
> ## 2.1   80.26167 194.1252 133.7265 1405.672    2  -614.3256      1388.751
> ## 2.2  117.08533 194.1252 133.7265 1845.499    2 -1051.4105      1866.505
> ## 3.1   80.84091 202.4517 154.3976 1302.205    3  3527.8154      1284.706
> ## 3.2  120.33775 202.4517 154.3976 1692.819    3  5037.6392      1717.210
> ## 4.1   68.96320 204.5145 151.6662 1321.927    4  5097.5261      1379.251
> ## 4.2  120.17029 204.5145 151.6662 2137.589    4  8559.2362      2108.628
> ## 5.1   84.85202 200.7799 153.1595 1821.135    5  4736.2240      1813.415
> ## 5.2  122.58730 200.7799 153.1595 2418.381    5  6641.2026      2426.855
> ## 6.1   84.60214 202.3324 153.2583 2015.183    6  1521.6831      2001.700
> ## 6.2  128.13418 202.3324 153.2583 2778.684    6  2086.4726      2799.053
> ## 7.1   74.73548 198.5761 155.8984 1780.196    7  1497.6761      1802.381
> ## 7.2  118.97636 198.5761 155.8984 2622.083    7  2127.8282      2603.626
> ## 8.1   83.40882 200.4991 152.0515 2342.043    8  1954.4592      2323.477
> ## 8.2  118.73997 200.4991 152.0515 3088.686    8  2601.6007      3105.182
> ## 9.1   82.48090 205.4736 152.2774 1979.852    9  2253.5650      1950.388
> ## 9.2  117.48279 205.4736 152.2774 2554.581    9  3027.9842      2587.319
> ## 10.1  84.77764 205.9689 152.3325 2256.434   10  2236.3346      2268.620
> ## 10.2 119.66417 205.9689 152.3325 3022.053   10  2980.1202      3012.406
>
>
> --
> Paul E. Johnson
> Professor, Political Science    Assoc. Director
> 1541 Lilac Lane, Room 504     Center for Research Methods
> University of Kansas               University of Kansas
> http://pj.freefaculty.org            http://quant.ku.edu
>
> _______________________________________________
> R-sig-mixed-models at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/



More information about the R-sig-mixed-models mailing list