[R-sig-ME] Mixed model (with interaction) for gene expression and iteration

Thu Jun 4 15:22:08 CEST 2009

Hi,

RE:
> For p-values:
>
> You could use LRT to test for variance components.
>
> This is standard practice in genetic epidemiology: fit a model with and without the random effect in question, then compare the log (residual) likelihood ratio to a chi-square  statistic.

This is borderline wrong - using a ChiSq(1) as the null distribution
for testing whether random effect variances are zero will give you a
terribly conservative test- at the least you should use a mixture of a
.5 point mass in zero and .5*ChiSq(1) as the reference distribution
(see Self/Liang (1987)).
More details on this can be found in Crainiceanu/Ruppert(2003)
<http://www.orie.cornell.edu/~davidr/papers/asymptoticpaper2.pdf>.

There is also a fairly fast and almost exact test for this type of
hypothesis implemented in package RLRsim which implements the ideas in
Crainiceanu/Ruppert(2003) and extends them to models with multiple
independent random effects (see our paper
<http://dx.doi.org/10.1016/j.csda.2007.10.022>).

I include some simulation code below to demonstrate why the
conventional (R)LRT is a really bad idea for this and why it might be
better  to consider using RLRsim or, at least, the ChiSq-Mixture
approximation (which will be worse if you have fewer i.i.d. subvectors
in your data, in your case: the less balanced your data is and the
less genotypes you've sampled).

##############
##begin  code ##
##############

#simulates data according to your description
#defaults taken from your reproducible example, except for sd.sexline
simData <- function(
        #no. of obs
        n = 120,
        #no. of genotypes
        n.line = 15,
        #sd of random intercept for genotype
        sd.line = 0.13,
        #sd of random intercpet for genotype:sex interaction
        sd.sexline = 0,
        #sd of errors
        sd.eps = 0.13,
        #fixed effect of sex
        beta.sex = 0.52,
        seed = NA)
{
    if (!is.na(seed)) set.seed(seed)

    sex <- factor(rep(c("M", "F"), times = n.line, each = n/n.line/2))
    line <- factor(rep(1:n.line, each = n/n.line))

    X <- model.matrix(~1 + sex)
    Z <- model.matrix(~0 + line + sex:line)

    Y1 <- X %*% c(0, beta.sex) +
             Z %*% c(rnorm(n.line, sd = sd.line),
                           rnorm(n.line, sd = sd.sexline)) +
             rnorm(n, sd = sd.eps)

    return(data.frame(Y1, line, sex))
}

fitModels <- function(d)
{
    require(lme4)
    m1a <- lmer(Y1 ~ sex + (1 | line) + (1 | sex:line),
            data = d)
    m1b <- lmer(Y1 ~ sex + (1 | line), data = d)
    return(list(sexline = m1a, line = m1b))
}

testSexLine <- function(models)
#test random effect for sex:line
{
    require(RLRsim)

    #exact Restricted Likelihood Ratio test of random effect
(1|sex:line) when random effect (1|line) is present in model
    sexlineOnly <- update(models$sexline, . ~ . - (1 |
                        line))
    t <- exactRLRT(sexlineOnly, models$sexline, models$line)
    p.exact <- t$p.value

    #approximate distribution of RLRT is a mixture of .5 mass in zero
and .5 * ChiSq(1)
    # see Self+Liang(1987)
    p.approx <- 1 - (0.5 + 0.5 * pchisq(t$statistic, df = 1))

    # ChiSq(1) as null distribution for Restricted LRT
    p.naive <- 1 - pchisq(t$statistic, df = 1)

    #return exact and approximate p-values
    return(c(exact = p.exact, chiSqMix = p.approx, naive = p.naive))
}

#simulate data under H0:sd.sexline = 0 and test this hypothesis 500 times
set.seed(123)
simPvals.0 <- replicate(500, testSexLine(fitModels(simData())))
# get rejection probabilities for alpha=.05
(rowMeans(simPvals.0 < 0.05) )
#exact       chiSqMix      naive
#0.050        0.038           0.014
# --> approximations are (much) more conservative, far from nominal level

##############
## end  code ##
##############

So, for a nominal level of 5%, you have a rejection probability of
only 1.4% for the ChiSq(1) and 3.8% for the ChiSqMixture -that's
pretty bad don't you think? - Especially since

>
> FDR can be used on top of that to account for multiple tests.

I'm no expert at all as far as multiple testing is concerned, but I'd
imagine that basing corrections for multiple testing on tests that
don't have the nominal level would tend to mess things up.

Using the exact test or the mixture approximation will give you more
power to detect variance components that are actually present, see for
example simulated data with the estimated parameters from your
example:

##############
##begin  code ##
##############
set.seed(234)
simPvals.048 <- replicate(500,
testSexLine(fitModels(simData(sd.sexline = 0.048))))
rowMeans(simPvals.048 < 0.05)

# exact   chiSqMix  naive
#  0.148   0.132      0.078

##############
## end  code ##
##############

The naive approach is really bad. Admittedly not a huge difference in
power between the exact test and the mixture approximation, but
remember: the mixture approximation will become even more conservative
for unbalanced data or data with fewer genotypes/ levels of line.

HTH,
Fabian