[R-sig-ME] Is it right to specify a random slope for the dummy variables

Thu Jan 1 20:44:48 CET 2009

This is an interesting question in that it provides an opportunity for
philosophical musings on fixed- and random-effects and on the way that
categorical covariates are incorporated into model matrices.  Allow me
to expand a bit (well, maybe more than a bit) on Reinhold's remarks
below.

On Wed, Dec 31, 2008 at 5:35 AM, Reinhold Kliegl
<reinhold.kliegl at gmail.com> wrote:
> By default, a categorical variable (factor) with n levels comes with
> n-1 treatment contrasts. Therefore, in the fixed-effect part of the
> model the intercept represents the reference level and the n-1
> contrasts represent the mean differences between the other levels and
> the reference level, assuming a balanced design. You can check your
> specification with contrasts(factor). Of course, you should change
> from treatment to other contrasts as required by your hypotheses. See
> ?contrasts.

> Now suppose you have the variable group as random factor in the model
> and you include the variable factor also in the random effects part:

> lmer(y ~  factor + (factor|group))

> Then, you can estimate the variance of the intercept (i.e., variance
> of reference level for groups), variances of the n-1  difference
> scores for group,  and correlations between intercept and difference
> scores as random effects (i.e., you estimate varying intercepts and
> varying differences and the correlations between them).

> Thus, with categorical variables you are mostly looking at the
> variance and correlation of difference scores between levels of a
> factor rather than variance and correlation of slopes (which are also
> a kind of difference score, of course).

Reinhold is exactly correct in stating that the coefficients
associated with a categorical factor are usually expressed as an
intercept and a set of k-1 "contrasts", where k is the number of
levels in the factor.  There are alternatives, however, and it can be
interesting to explore them.

For brevity let me write f for a factor whose levels are considered
fixed and repeatable, g for a factor whose levels represent a sample
from a, possibly infinite, set of potential levels, x for a continuous
covariate and y for the response.  The number of levels for factor f
is k.

As most R users are (or should be) aware, a model formula, which is
used to generate a model matrix from a data set, includes an implicit
intercept term.  Thus the formula

y ~ x

is equivalent to the formula

y ~ 1 + x

In fact, many people prefer to use the second form because it more
clearly illustrates that there will be two coefficients estimated, the
first associated with a constant term and the second associated with
the numeric values of x.

If we wish to suppress the intercept we can replace the '1' by a '0'
to obtain the formula

y ~ 0 + x

When we have only continuous covariates in the model formula, the
underlying models represented by 1 + x and 0 + x are different but the
parameterization is the same.  That statement may seem opaque but bear
with me and I will try to explain what I mean.  The "underlying model"
is the set of all possible fitted values from the model.  One of the
characteristics of linear models is that the model is defined by the
set of all possible predictions, which must be a linear subspace of
the response space.  For those who know the linear algebra
terminology, this subspace is the column span of the model matrix.
The coefficients are associated with a particular parameterization but
the model itself exists independently of the parameterization.  (It is
this concept that Don Watts and I ported over to nonlinear regression
models to explain some of the effects of nonlinearity.)

So what I was saying about 1 + x versus 0 + x is that the column spans
of the model matrices are different; the first is two dimensional and
the second is one dimensional, but the parameterization for the column
spans is the same.

This is not the case for a categorical factor.  The model formula

y ~ 1 + f

and the model formula

y ~ 0 + f

generate the same model.  Each generates a model matrix whose column
span is the k-dimensional span of the indicator columns for the levels
of f.  For the second formula the parameterization is derived from the
indicator columns.  We can think of the parameterization for the first
model formula as resulting from the constant column followed by the
complete set of indicator columns generating a model matrix with k + 1
columns and rank k.  To obtain a full-rank matrix we must drop some
linear combination of the columns.  The default method is to drop the
indicator of the first level of the factor but any one will do.  That
is, we have a well-defined column span but with an arbitrary
parameterization.

There is an interesting point here regarding the difference between
fixed-effects and random-effects terms.  The coefficients in
fixed-effects terms are estimated by least squares, which I think of
as a rigid criterion.  The model matrix for the fixed-effects terms
must have full column rank.  If it doesn't then the coefficient
estimates are undetermined.  Thus, to obtain a well-defined and unique
set of coefficient estimates we must check the rank of the model
matrix and adjust the form of the matrix (and hence the
parameterization of the model) if we detect a rank-deficient matrix.
Much of the checking and adjustment can be and is done at the symbolic
level, which is why the code for model.matrix is much more subtle than
most would suspect (and also why just about any formula about analysis
of variance given in an introductory book is not really the way that
analysis of variance results are calculated).  The model formula ->
model frame -> model matrix path in the S language is almost never
recognized for the great accomplishment that it is.  However, even the
symbolic analysis doesn't account for all cases, which is why there is
a secondary check on the numerical rank of the model matrix -
something that is not nearly as simple as it sounds.  As anyone who
has taken a linear algebra class knows, the rank of a matrix is a
well-defined property that is easily evaluated.  As anyone who has
taken a numerical linear algebra class knows, the rank of a matrix is
essentially undefined and impossible to evaluate reliably.

One side note; it is exactly the need to assign a rank to a model
matrix and to take corrective action for rank-deficient cases that
keeps us using Linpack code for the QR decomposition instead of the
numerically superior and more efficient Lapack code.

Anyway, back at the fixed-effects versus random-effects computational
issues.  The coefficients for a random-effects term are "estimated" by
penalized least squares, which I think of as a flexible criterion.  (I
would embark on analogies about least squares being the oak and
penalized least squares being the willow in the wind storm but that is
probably a little too much.)  The penalty term takes care of any
problems with rank deficiencies.  We say that it "regularizes" the
calculation in the sense that that the conditional means of the random
effects are "shrunken" toward zero relative to the least squares
estimates.  It is exactly this regularization that allows the models
to be expressed in the natural parameterization.  That is, although
the model matrix for the formula

y ~ 1 + f

cannot be expressed as a constant and the complete set of indicator
columns because it will be rank deficient, the model matrix for the
formula

y ~ 1 + (1 | g)

is expressed as a constant and the complete set of indicator columns
for the levels of g.  Furthermore all of these coefficients are
estimable.

Finally I get to the point of Zhijie's question about incorporating
coefficients for a factor f in the random effects associated with the
levels of g.  Reinhold explained what the formula that could be
written as

1)   y ~ 1 + f + (1 + f | g)

would generate.  This model is equivalent to

2)   y ~ 1 + f + (0 + f | g)

or

3)  y ~ 0 + f + (0 + f | g)

and I would recommend one of these forms for interpretability.  The
parameterization of the fixed-effects is unimportant so either 2) or
3) will do.  The parameterization of the random effects is similarly
unimportant in that formulas 1), 2) and 3) all generate the same model
fits but I feel there is an advantage in using the indicator column
representation from 2) and 3).  All three of these formulas will
result in k variances and k(k-1)/2 covariances (or correlations) being
estimated and up to k random effects for each level of g.  In 2) and
3) the variances reported are the variance of the effect of level i of
factor f on level j on factor g (or vice versa).

The disadvantage of all three formulations is that when k is moderate
the number of variance/covariance parameters being estimated,
k(k+1)/2, can be large.   An alternative model, which I prefer as a
starting point, is

4) y ~ 0 + f + (1 | g) + (1 | f:g)

Model 4) allows for an additive shift for each level of of g and an
additive shift for each (observed) combination of levels of f and g.
There are actually more random effects in 4) than in 1), 2) or 3) but
many fewer variance/covariance parameters.  Model 4) has only two
variance parameters to be estimated and no covariances.   Model 4)
corresponds to model 3) subject to the constraint that the
variance-covariance matrix for the random effects have a compound
symmetry pattern.

Some of these issues are illustrated in the section "Interactions of
grouping factors and other covariates" in the slides at
http://www.stat.wisc.edu/~bates/UseR2008/WorkshopD.pdf (slides 84 to
95).

> On Wed, Dec 31, 2008 at 9:45 AM, zhijie zhang <epistat at gmail.com> wrote:
>> Dear all,
>>  Today, i was thinking the following question.
>> We know the variables may be classified into continuous, ordinal, and
>> categorical variables. I was confused about how to handle with
>> the categorical variables in the multi-level models.
>> For fixed effects,  the categorical variables were always treated as dummy
>> variables, my questions are:
>> 1. Could the random slope be specified for categorical variables that was
>> always changed into the form of  dummy variables?
>> 2. If  the random slope could be specified for categorical variables, how to
>> explain it? It seems a little different from the continuous variables.
>>  I tried the GLIMMIX Procedure in SAS. It seems that SAS treats categorical
>> variables as  continuous variables. While in MLWin, it seems that random
>> slope could be specified for the dummy variables .
>>  Any ideas on it are greatly appreciated.