[R-sig-ME] Mixed model (with interaction) for gene expression and iteration

Thu Jun 4 09:56:12 CEST 2009

Dear Douglas, Rolf, Juan and list,

thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will 
definitely help.

I had a go with mcmcsamp(), and I must confirm that this approach is not 
feasible, both computationally and because if you get a "false 
convergence" for, say, 1 gene out of 20, it becomes impossible to go 
back and fix all the errors.

So, the alternative approach seems more promising. If I understand 
correctly, you suggest to calculate a p-value for random effects out of 
the LRT (Likelihood ratio test), and use approximated DFs to calculate 
standard p-values for the fixed effects.

I neved used this approach, so I appreciate if you can point me in the 
right direction.

Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)

and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out 
the details?

Fixed effects:
I don't really know where to get the approximated degrees of freedom. 
Can you point me to an example?

Thanks again for all the help. Eventually, I'll be really happy to share 
my experience/code when everything is sorted, even if I doubt I can add 
anything helpful.

Best,
paolo

PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?

Juan Pedro Steibel wrote:
> Hello Paolo,
> We are also starting to use lmer for gene expression analysis (genetical 
> genomics too) so here are my thoughts.
> 
> Having two equivalent parameterizations, I would go with the 
> computationally fastest one (a couple more miliseconds per model, easily 
> add up to many hours when analyzing highly dimensional datasets). You 
> can time the analysis for, say, 100 transcripts and go from there.
> 
> ~note: other users commented on your models not being equivalent~
> 
> For p-values:
> 
> You could use LRT to test for variance components.
> 
> This is standard practice in genetic epidemiology: fit a model with and 
> without the random effect in question, then compare the log (residual) 
> likelihood ratio to a chi-square statistic. FDR can be used on top of 
> that to account for multiple tests. Of course, now you have to fit three 
> models (one null models for each VC), so your cpu  time has just 
> multiplied by almost 3. Definitely using refit and update will help with 
> compute time when having so many models.
> 
> I use sobj<-summary(result) as an intermediate step to get the info 
> (although this may add to the computational burden and other suggestions 
> you got may be faster and more efficient),
> 
> then ask for slots:
> @REmat
> @coefs
> ...for getting estimates of variance components and fixed effects.
> 
> @coefs gives you a t-statistic for fixed effects too... you could take a 
> stab at an approximated df method and compute an (approximated) p-value.
> I know that doing so can attract a lot of criticism in this list, but 
> when you are fitting a several million models (10000s of transcripts and 
> 1000s of genomic positions as in my case), the mcmc approximation is 
> (unfortunately) not computationally feasible.
> 
> 
> Hope this helps!
> JP
> 
> 
> 
> Paolo Innocenti wrote:
>> Dear Douglas and list,
>>
>> I am thinking about fitting a mixed model for a microarray experiment 
>> using lme4, since other specific software seems not suitable for my 
>> purposes. I'll briefly describe the model and kindly ask for 
>> suggestions on the model and the workflow I can use to get useful 
>> results.
>>
>> My response variable Y is gene expression levels for a given gene (say 
>> g_i) from 120 samples.
>> The factor I want to include are:
>>
>> Sex: fixed, two levels, M/F.
>> Line: 15 randomly picked genotypes from a large outbred population.
>>
>> I am interested in:
>> - if the gene is differentially expressed in the 2 sexes (effect of 
>> sex), in the 15 lines (effect of line) and the interaction of the two.
>>
>> - the variance component of line = how much of the variance is due to 
>> the genotype
>>
>> - the variance component of the interaction = the genetic variation 
>> for sexual dimorphism.
>>
>> Reading a bit of this mailing list, I came up with these three models:
>>
>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>>
>> or
>>
>> m2 <- lmer(Y1 ~ sex + (sex|line))
>>
>> or
>>
>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>>
>> Which should all be the same model (and indeed they have all the same 
>> residuals) but different parametrization (see self-contained example 
>> below).
>>
>> Now, in the light of my needs (see above), which model makes it easier 
>> to extract the components I need? Also, do they make different 
>> assumptions - as different levels of independency among levels of 
>> random factors?
>>
>> I will need to be able to extract the variance component values in an 
>> iterative process (i have 18.000 genes): is VarCorr() the way to go?
>>
>> VarCorr(m1)$'sex:line'[1]
>> VarCorr(m1)$'line'[1]
>>
>> Last two question: what is the easier way to assess, in an iterative 
>> process, normality of residuals, and what is a sensible way to assess 
>> significant differential expression of genes (since I guess I can't 
>> get p-values and then apply FDR correction?)
>>
>> Thanks a lot for reading so far and I'll be grateful for any kind of 
>> help.
>> paolo
>>
>> Self-contained example:
>>
>> Y1 <- as.numeric(
>> c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
>> "11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
>> "11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
>> "12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
>> "11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
>> "11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
>> "11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
>> "11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
>> "11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
>> "12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
>> "12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
>> "11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
>> "12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
>> "11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
>> "11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
>> "11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
>> "11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
>> "11.5300"))
>> sex <- factor(rep(c("F","M"), 15, each=4))
>> line <- factor(rep(1:15, each=8))
>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>> m2 <- lmer(Y1 ~ sex + (sex|line))
>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>> VarCorr(m1)$'sex:line'[1]
>> VarCorr(m1)$'line'[1]
>>
>> Output:
>>
>>>> m1
>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 | 
>>> sex:line)     AIC   BIC logLik deviance REMLdev
>>>  -91.13 -77.2  50.57   -111.1  -101.1
>>> Random effects:
>>>  Groups   Name        Variance  Std.Dev.
>>>  sex:line (Intercept) 0.0023237 0.048205
>>>  line     (Intercept) 0.0169393 0.130151
>>>  Residual             0.0168238 0.129707
>>> Number of obs: 120, groups: sex:line, 30; line, 15
>>>
>>> Fixed effects:
>>>             Estimate Std. Error t value
>>> (Intercept) 11.45977    0.03955  289.72
>>> sexM         0.52992    0.02951   17.96
>>>
>>> Correlation of Fixed Effects:
>>>      (Intr)
>>> sexM -0.373
>>>> m2
>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line) 
>>>  AIC    BIC logLik deviance REMLdev
>>>  -90 -73.27     51   -112.1    -102
>>> Random effects:
>>>  Groups   Name        Variance  Std.Dev. Corr   line     (Intercept) 
>>> 0.0152993 0.123691                 sexM        0.0046474 0.068172 
>>> 0.194  Residual             0.0168238 0.129707       Number of obs: 
>>> 120, groups: line, 15
>>>
>>> Fixed effects:
>>>             Estimate Std. Error t value
>>> (Intercept) 11.45977    0.03606   317.8
>>> sexM         0.52992    0.02951    18.0
>>>
>>> Correlation of Fixed Effects:
>>>      (Intr)
>>> sexM -0.161
>>>> m3
>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line) 
>>>  AIC    BIC logLik deviance REMLdev
>>>  -90 -73.27     51   -112.1    -102
>>> Random effects:
>>>  Groups   Name Variance Std.Dev. Corr   line     sexF 0.015299 
>>> 0.12369                  sexM 0.023227 0.15240  0.899  Residual      
>>> 0.016824 0.12971        Number of obs: 120, groups: line, 15
>>>
>>> Fixed effects:
>>>             Estimate Std. Error t value
>>> (Intercept) 11.45977    0.03606   317.8
>>> sexM         0.52991    0.02951    18.0
>>>
>>> Correlation of Fixed Effects:
>>>      (Intr)
>>> sexM -0.161
>>
>>
>>
> 
> 

-- 
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyvägen 18D
75236 Uppsala, Sweden