[R-sig-ME] need help with predicted values from lmer with newdata

Fri Jun 22 05:28:06 CEST 2012

Greetings.

I'm getting some crazy results out of predict on an lmer model.  I've
gone through
this lots of times, and feel certain I'm missing some obvious mistake.
 If you browse
through and get to the bottom, there's some output that shows my predict output
is just wrong, but I can't see why.

I will send the first person with the right answer a new KU t-shirt in
the size of
your choice.  That's better than "thanks in advance". Isn't it?

## Paul Johnson <pauljohn at ku.edu>
## 2012-06-21

## I want to compare the "separate ols regressions on clusters"
## results with the mixed model estimates. I've run into
## trouble because the output of predict with a newdata object
## seems to be completely wrong.  I demonstrate "manual" calculations
## to try to convince you I understand what's going on (but
## peculiarity of predict output may make things appear otherwise).

## I pasted in some output at the bottom in case you just want to
## see where I think there is a problem.

## My system:
## > sessionInfo()
## R version 2.15.0 (2012-03-30)
## Platform: x86_64-pc-linux-gnu (64-bit)

## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=C                 LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base

## other attached packages:
## [1] lme4_0.999902344-0 Matrix_1.0-6       lattice_0.20-6     MASS_7.3-18
## loaded via a namespace (and not attached):
## [1] compiler_2.15.0 grid_2.15.0     minqa_1.2.1     nlme_3.1-104
## [5] splines_2.15.0  tools_2.15.0

## Topic: create multi-level data and use regression and lmer to
## estimate it.  This creates 3 different dependent variables,
## y1: ordinary "homoskedastic" regression (no grouping effect)
## y2: clustered "random intercepts"
## y3: clustered "random intercepts" and "random slope" for 1 predictor (x1)

library(MASS)
set.seed(1234)

## Step 1. Create a Data Frame.

## M respondents, N observations for each one.
## In a repeated measures context, this is called "longitudinal data".
## In cross sectional approach, this is M groups (classrooms) with
## several people in each)

M <- 10
N <- 30

## get M unique gray colors, will use later for plotting
grays <- gray.colors(M)

## Standard deviation of error term at individual level
STDE <- 48

## STDEM: standard deviation of clustered intercepts.
### In a longitudinal "repeated measures" exercise, this is an
## individual-level effect. In a cross section, this is a random
## classroom effect.
STDEM <- 30

## STEx1: standard deviation of slopes across cluster units
STDEx1 <- 5

## The true effects of b0, b1, b2, b3 in
## y = b0 + b1*x1 + b2*x2 + b3*x3
bslopes <- c(0.2, 15.4, -0.2, 3)

## Now generate the data frame with x1, x2 and x3.
## Let's suppose the predictors are Multivariate Normal, with
## means (100,200,150), standard deviations (10, 20, 30), and
## intercorrelations of 0.4.  These can, of course, be adjusted
## for interesting variations.

## Mind, an "indicator" that says with which cluster an observation belongs.
Mind <- 1:M %x% rep(1,N)

means <- c(100, 200, 150)
sds <- c(10, 20, 30)
rho <- 0.4
corr.mat <- matrix(c(1, rho, rho, rho, 1, rho, rho, rho, 1), nrow = 3)
sigma <- diag(sds) %*% corr.mat %*% diag(sds)
x.mat <- mvrnorm(n = N * M, mu = means, Sigma = sigma)
dimnames(x.mat)[[2]] <- c("x1", "x2", "x3")
## Add an intercept on the front of x.mat
x.mat <- cbind("(Intercept)"= 1, x.mat)

## Create a dependent variable that has no clustering effects.  This
## is an "ordinary" regression with slopes and random error designated
## above random noise
y1 <- x.mat %*% bslopes + rnorm(M*N, m=0, s= STDE)
dat <- data.frame(id=1:(N*M), Mind, y1, as.data.frame(x.mat))
rm(Mind, y1, x.mat) ## cleanup workspace

## Layer on additive group level error in y2
## Add a group-level intercept.
## A vector of M disturbances is created, and each is
## added to all N case within its cluster
dat$y2 <- dat$y1 +  rnorm(M, 0, STDEM) %x% rep(1, N)

## In y3, add in random slope effect
## are changing the slope for x1 from b1 to N(b1, STDEx1^2).
## Same as newb1 ~ b1 + N(0, STDEx1) * x1
dat$y3 <- dat$y2 +   (rnorm(M, 0, STDEx1) %x% rep(1, N)) * dat$x1

## In the lme4 package, there is an "easy" tool to run M separate lm regressions
library(lme4)

##100 separate regressions, with 3 predictors
m3list <- lmList(y3 ~ x1 + x2 + x3 | Mind, data=dat, pool = FALSE)

## Now I see a pattern. I'll use the unique range on x1 for each cluster
## I think that makes the lines look more "in" the data.
plot(y3 ~ x1, data=dat, col=grays[Mind], main = "lm on clusters")
for( i in seq_along(m3list)){
    m3mf <- model.frame(m3list[[i]]) #data set for group i
    x1range <- range(m3mf$x1) ## use group-specific ranges this time
    pgroup <- predict( m3list[[i]], newdata = data.frame(x1=x1range,
x2=mean(m3mf$x2), x3=mean(m3mf$x3)))
    lines(x1range, pgroup, col=grays[i])
}

## Hmm. That data looks like it might be handy for comparison later on.
## Build a data frame of it. I've never thought of doing this work
## cluster-by-cluster before.

m3newdat <- lapply(m3list, function(x) {
    m3mf <- model.frame(x)
    ndf = data.frame(x1=range(m3mf$x1), x2=mean(m3mf$x2), x3=mean(m3mf$x3))
    ndf$m3pred <- predict(x, newdata = ndf)
    ndf} )
## Smash the list of data frames together
m3newdat <- do.call("rbind", m3newdat)
## Interesting!
m3newdat[1:20, ]
## Better add a variable Mind. This looks stupid, but works.
m3newdat$Mind <- as.integer( do.call("rbind",
strsplit(row.names(m3newdat), split="\\.", perl=T))[ ,1] )

## Draw new graphs on a new device, so we can compare
dev.new()

## The "right" model allows the random effects to be correlated.
mm4 <- lmer( y3 ~ x1 + x2 + x3 + (x1 | Mind), data=dat)
summary(mm4)

mm4ranef <- ranef(mm4, postVar = TRUE) ## a ranef.mer object, a list
that includes one data frame

### What do the predicted values mean from lmer?
### here they are. But what do they include?
head(mm4pred <- predict(mm4))
tail(mm4pred <- predict(mm4))

## I can't tell what those numbers are. Can't be sure from ?predict.merMod
## So I'm going to go through the step-by-step
## process of calculating predicted values. These should follow a
formula like this
## for model 4. j is the "cluster" index.
##
##  j'th intercept                j'th slope
##  (b0 + raneff(intercept, j)) + (b1 + raneff(x1,j))*x1 + b2 x2 + b3 x3

mm4b <- coef(summary(mm4))[ ,1]
mm4vnames <- names(mm4b)
mm4mm <- model.matrix(mm4) ## predictors

## Doing this bit by bit. First
## b0 + raneff(intercept, j)
mm4inteffect <- mm4b["(Intercept)"] +  mm4ranef[[1]][dat$Mind, 1]
mm4x1effect <- mm4mm[ , c("x1")] * (mm4b["x1"] + mm4ranef[[1]][dat$Mind, 2])
mm4pred2 <- mm4inteffect + mm4x1effect +  mm4mm[ ,c("x2","x3") ] %*%
mm4b[c("x2","x3")]
head(mm4pred2)
tail(mm4pred2)

## Aha! Those exactly match predict for mm4. So I understand what "predict" does

## Now, I want to run predict for some particular values of x1, x2, x3.

## Let's try running m3newdat through the predict function for mm4.
m3newdat$mm4.3 <- predict(mm4, newdata = m3newdat)
m3newdat

## Disaster. Predicted values completely out of whack. Lots of
## lines fall off the bottom of the graph.

plot(y3 ~ x1, data=dat, col=grays[Mind], main="lmer mixed model predictions")
by(m3newdat, m3newdat$Mind, function(x) { lines(x$x1, x$mm4.3,
col=grays[x$Mind])})

## They don't match the manual calculations I perform here
mm4b <- coef(summary(mm4))[ ,1]
mm4vnames <- names(mm4b)
mm4mm <- as.matrix(m3newdat)

##  b0 + raneff(intercept, j)
mm4inteffect <- mm4b["(Intercept)"] + rep(mm4ranef[[1]][, 1], each=2)
mm4x1effect <- mm4mm[ , c("x1")] * rep(mm4b["x1"] + mm4ranef[[1]][, 2], each=2)
mm4manualpred2 <- mm4inteffect + mm4x1effect +  mm4mm[ ,c("x2","x3") ]
%*%  mm4b[c("x2","x3")]

mm4mm <- cbind(as.data.frame(mm4mm), "mm4manualpred" = mm4manualpred2)
mm4mm

## Here's what I see when I do that. m3pred: the lmList predictions.
## mm4manualpred: my "manual" calculations of predicted values for lmer mm4
## mm4.3 output from predict for same "newdat" used with mm4manualpred.

## > mm4mm
##             x1       x2       x3   m3pred Mind      mm4.3 mm4manualpred
## 1.1   88.51074 196.0615 141.3859 1221.477    1  -434.7602      1239.152
## 1.2  110.88258 196.0615 141.3859 1483.957    1  -630.0463      1466.961
## 2.1   80.26167 194.1252 133.7265 1405.672    2  -614.3256      1388.751
## 2.2  117.08533 194.1252 133.7265 1845.499    2 -1051.4105      1866.505
## 3.1   80.84091 202.4517 154.3976 1302.205    3  3527.8154      1284.706
## 3.2  120.33775 202.4517 154.3976 1692.819    3  5037.6392      1717.210
## 4.1   68.96320 204.5145 151.6662 1321.927    4  5097.5261      1379.251
## 4.2  120.17029 204.5145 151.6662 2137.589    4  8559.2362      2108.628
## 5.1   84.85202 200.7799 153.1595 1821.135    5  4736.2240      1813.415
## 5.2  122.58730 200.7799 153.1595 2418.381    5  6641.2026      2426.855
## 6.1   84.60214 202.3324 153.2583 2015.183    6  1521.6831      2001.700
## 6.2  128.13418 202.3324 153.2583 2778.684    6  2086.4726      2799.053
## 7.1   74.73548 198.5761 155.8984 1780.196    7  1497.6761      1802.381
## 7.2  118.97636 198.5761 155.8984 2622.083    7  2127.8282      2603.626
## 8.1   83.40882 200.4991 152.0515 2342.043    8  1954.4592      2323.477
## 8.2  118.73997 200.4991 152.0515 3088.686    8  2601.6007      3105.182
## 9.1   82.48090 205.4736 152.2774 1979.852    9  2253.5650      1950.388
## 9.2  117.48279 205.4736 152.2774 2554.581    9  3027.9842      2587.319
## 10.1  84.77764 205.9689 152.3325 2256.434   10  2236.3346      2268.620
## 10.2 119.66417 205.9689 152.3325 3022.053   10  2980.1202      3012.406

-- 
Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu