[R-sig-ME] Mixed model (with interaction) for gene expression and iteration

Thu Jun 4 21:36:33 CEST 2009

Hello paolo,
sobj<-summary(result), where result is m1a, m1b or m1c.

Not that m1c is actually a fixed effects model, and can not be fit with 
lmer, so you'll need to reconstruct the -2LL from another source. But 
with a normal model that should not be a problem.

Fabian Scheipl answered in a post that the LRT chisquare(1) was wrong 
and said that al least a mixture of chisquares should be used. I did not 
explained myself very well, but I meant to recommend the mixture of 
chisquare and pointmass when testing a single variance component.  
Fabian's other method could be used, provided it will not take eons to 
get the results for a high dimensional dataset.

But keep reading and you will find a good reason to hammer on me for 
recomendations on df and computing p-values 8^D

Probably the whole point for having mcmcsamp was to actually not have to 
answer that question on ddfs...
Having said that, my approach in a particular project we are working on 
is to do ddf=n-p... (n=length(y), p=ncols(x), x= full rank incidence 
matrix of fixed effects). I'll run and take cover now... 8^D
Seriously, in our particular design such approximation does not seem to 
be way off, as I checked it with mcmcsamp for a handful of transcripts.

And finally, a more existential question:
Paolo, do you really need a p-value?
Sometimes in gene expression analysis we only need to rank genes for 
evidence of differential expression.
In those cases, you may well rank then using the t-statistic or the LRT 
(not their p-values), especially if you have the same basic design 
structure across all genes or transcripts (that you probably do).
For example if you want to do enrichment of GO terms after the mm 
analysis you could do a Kolmogorov-Smirnov type of GSE using the 
t-statistics (or LRTs) to rank genes.

Hope this makes sense?
Thanks!
JP

Paolo Innocenti wrote:
> Dear Douglas, Rolf, Juan and list,
>
> thank you very much for your replies.
> I now got a good working model, and the use of refit and VarCorr will 
> definitely help.
>
> I had a go with mcmcsamp(), and I must confirm that this approach is 
> not feasible, both computationally and because if you get a "false 
> convergence" for, say, 1 gene out of 20, it becomes impossible to go 
> back and fix all the errors.
>
> So, the alternative approach seems more promising. If I understand 
> correctly, you suggest to calculate a p-value for random effects out 
> of the LRT (Likelihood ratio test), and use approximated DFs to 
> calculate standard p-values for the fixed effects.
>
> I neved used this approach, so I appreciate if you can point me in the 
> right direction.
>
> Random effects:
> I'd need to compare the this three model:
> m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
> m1b <- lmer(Y1 ~ sex + (1|line))
> m1c <- lmer(Y1 ~ sex)
>
> and get the effect of the interaction from m1a-m1b,
> and the effect of "line" from m1b-m1c.
> It that correct?
> Can anyone point me to any kind of documentation/examples to sort out 
> the details?
>
> Fixed effects:
> I don't really know where to get the approximated degrees of freedom. 
> Can you point me to an example?
>
> Thanks again for all the help. Eventually, I'll be really happy to 
> share my experience/code when everything is sorted, even if I doubt I 
> can add anything helpful.
>
> Best,
> paolo
>
> PS. I don't really understand what you mean by
> sobj<-summary(result)
> what object is your "result" here?
>
>
>
>
> Juan Pedro Steibel wrote:
>> Hello Paolo,
>> We are also starting to use lmer for gene expression analysis 
>> (genetical genomics too) so here are my thoughts.
>>
>> Having two equivalent parameterizations, I would go with the 
>> computationally fastest one (a couple more miliseconds per model, 
>> easily add up to many hours when analyzing highly dimensional 
>> datasets). You can time the analysis for, say, 100 transcripts and go 
>> from there.
>>
>> ~note: other users commented on your models not being equivalent~
>>
>> For p-values:
>>
>> You could use LRT to test for variance components.
>>
>> This is standard practice in genetic epidemiology: fit a model with 
>> and without the random effect in question, then compare the log 
>> (residual) likelihood ratio to a chi-square statistic. FDR can be 
>> used on top of that to account for multiple tests. Of course, now you 
>> have to fit three models (one null models for each VC), so your cpu  
>> time has just multiplied by almost 3. Definitely using refit and 
>> update will help with compute time when having so many models.
>>
>> I use sobj<-summary(result) as an intermediate step to get the info 
>> (although this may add to the computational burden and other 
>> suggestions you got may be faster and more efficient),
>>
>> then ask for slots:
>> @REmat
>> @coefs
>> ...for getting estimates of variance components and fixed effects.
>>
>> @coefs gives you a t-statistic for fixed effects too... you could 
>> take a stab at an approximated df method and compute an 
>> (approximated) p-value.
>> I know that doing so can attract a lot of criticism in this list, but 
>> when you are fitting a several million models (10000s of transcripts 
>> and 1000s of genomic positions as in my case), the mcmc approximation 
>> is (unfortunately) not computationally feasible.
>>
>>
>> Hope this helps!
>> JP
>>
>>
>>
>> Paolo Innocenti wrote:
>>> Dear Douglas and list,
>>>
>>> I am thinking about fitting a mixed model for a microarray 
>>> experiment using lme4, since other specific software seems not 
>>> suitable for my purposes. I'll briefly describe the model and kindly 
>>> ask for suggestions on the model and the workflow I can use to get 
>>> useful results.
>>>
>>> My response variable Y is gene expression levels for a given gene 
>>> (say g_i) from 120 samples.
>>> The factor I want to include are:
>>>
>>> Sex: fixed, two levels, M/F.
>>> Line: 15 randomly picked genotypes from a large outbred population.
>>>
>>> I am interested in:
>>> - if the gene is differentially expressed in the 2 sexes (effect of 
>>> sex), in the 15 lines (effect of line) and the interaction of the two.
>>>
>>> - the variance component of line = how much of the variance is due 
>>> to the genotype
>>>
>>> - the variance component of the interaction = the genetic variation 
>>> for sexual dimorphism.
>>>
>>> Reading a bit of this mailing list, I came up with these three models:
>>>
>>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>>>
>>> or
>>>
>>> m2 <- lmer(Y1 ~ sex + (sex|line))
>>>
>>> or
>>>
>>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>>>
>>> Which should all be the same model (and indeed they have all the 
>>> same residuals) but different parametrization (see self-contained 
>>> example below).
>>>
>>> Now, in the light of my needs (see above), which model makes it 
>>> easier to extract the components I need? Also, do they make 
>>> different assumptions - as different levels of independency among 
>>> levels of random factors?
>>>
>>> I will need to be able to extract the variance component values in 
>>> an iterative process (i have 18.000 genes): is VarCorr() the way to go?
>>>
>>> VarCorr(m1)$'sex:line'[1]
>>> VarCorr(m1)$'line'[1]
>>>
>>> Last two question: what is the easier way to assess, in an iterative 
>>> process, normality of residuals, and what is a sensible way to 
>>> assess significant differential expression of genes (since I guess I 
>>> can't get p-values and then apply FDR correction?)
>>>
>>> Thanks a lot for reading so far and I'll be grateful for any kind of 
>>> help.
>>> paolo
>>>
>>> Self-contained example:
>>>
>>> Y1 <- as.numeric(
>>> c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491", 
>>>
>>> "11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
>>> "11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
>>> "12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
>>> "11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
>>> "11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
>>> "11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
>>> "11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
>>> "11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
>>> "12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
>>> "12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
>>> "11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
>>> "12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
>>> "11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
>>> "11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
>>> "11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
>>> "11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
>>> "11.5300"))
>>> sex <- factor(rep(c("F","M"), 15, each=4))
>>> line <- factor(rep(1:15, each=8))
>>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>>> m2 <- lmer(Y1 ~ sex + (sex|line))
>>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>>> VarCorr(m1)$'sex:line'[1]
>>> VarCorr(m1)$'line'[1]
>>>
>>> Output:
>>>
>>>>> m1
>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 
>>>> | sex:line)     AIC   BIC logLik deviance REMLdev
>>>>  -91.13 -77.2  50.57   -111.1  -101.1
>>>> Random effects:
>>>>  Groups   Name        Variance  Std.Dev.
>>>>  sex:line (Intercept) 0.0023237 0.048205
>>>>  line     (Intercept) 0.0169393 0.130151
>>>>  Residual             0.0168238 0.129707
>>>> Number of obs: 120, groups: sex:line, 30; line, 15
>>>>
>>>> Fixed effects:
>>>>             Estimate Std. Error t value
>>>> (Intercept) 11.45977    0.03955  289.72
>>>> sexM         0.52992    0.02951   17.96
>>>>
>>>> Correlation of Fixed Effects:
>>>>      (Intr)
>>>> sexM -0.373
>>>>> m2
>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line) 
>>>>  AIC    BIC logLik deviance REMLdev
>>>>  -90 -73.27     51   -112.1    -102
>>>> Random effects:
>>>>  Groups   Name        Variance  Std.Dev. Corr   line     
>>>> (Intercept) 0.0152993 0.123691                 sexM        
>>>> 0.0046474 0.068172 0.194  Residual             0.0168238 
>>>> 0.129707       Number of obs: 120, groups: line, 15
>>>>
>>>> Fixed effects:
>>>>             Estimate Std. Error t value
>>>> (Intercept) 11.45977    0.03606   317.8
>>>> sexM         0.52992    0.02951    18.0
>>>>
>>>> Correlation of Fixed Effects:
>>>>      (Intr)
>>>> sexM -0.161
>>>>> m3
>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line) 
>>>>  AIC    BIC logLik deviance REMLdev
>>>>  -90 -73.27     51   -112.1    -102
>>>> Random effects:
>>>>  Groups   Name Variance Std.Dev. Corr   line     sexF 0.015299 
>>>> 0.12369                  sexM 0.023227 0.15240  0.899  
>>>> Residual      0.016824 0.12971        Number of obs: 120, groups: 
>>>> line, 15
>>>>
>>>> Fixed effects:
>>>>             Estimate Std. Error t value
>>>> (Intercept) 11.45977    0.03606   317.8
>>>> sexM         0.52991    0.02951    18.0
>>>>
>>>> Correlation of Fixed Effects:
>>>>      (Intr)
>>>> sexM -0.161
>>>
>>>
>>>
>>
>>
>

-- 
=============================
Juan Pedro Steibel

Assistant Professor
Statistical Genetics and Genomics

Department of Animal Science & 
Department of Fisheries and Wildlife

Michigan State University
1205-I Anthony Hall
East Lansing, MI
48824 USA 

Phone: 1-517-353-5102
E-mail: steibelj at msu.edu