[R-sig-ME] Mixed model (with interaction) for gene expression and iteration

Fri Jun 5 19:30:50 CEST 2009

Hi all,

thanks again to all of you for the replies.
I took Fabian's suggestion (thanks for the effort you made to make your 
point! amazing) but unfortunately for my purpose it requires to much 
computation time (in terms of model to be fitted and evaluation time). 
At the moment I am happy with

1-(0.5 + 0.5*pchisq(tempobj$Chisq[2], df=1))

for the p-values for the random effects.

For the p-values for fixed effects: yes, i do really need them. They are 
not really important for THE specific gene, but I need to be able to 
say: around 30% of the transcriptome is differentially expressed).

In my case, the effect I am testing is pretty big, my sample size is 
pretty large (120 observations, 60 per group - M/F), completely balanced 
and I can afford to be a bit conservative. That said, I found this link:

http://wiki.r-project.org/rwiki/doku.php?id=guides:lmer-tests

where Doug discuss about using a function hatTrace() and use it to 
calculate the trace and then the degrees of freedom (denDF = n - "trace" 
i think) and feed it to

1 - pf(F.ratio, numDF, ***)

It seems to be what I am looking for, but I can't find the function 
hatTrace...
Does anyone know how I could estimate my denDF? Any suggestion is most 
welcome (JP i have some problem in understanding your lingo... =)
Pinheiro and Bates says:

denDF = m_i - (m_(i-1) + p_i),  i = 1,...,Q+1

but I don't really understand what it means.

Thanks again to everyone!
Best,
paolo

Juan Pedro Steibel wrote:
> Hello paolo,
> sobj<-summary(result), where result is m1a, m1b or m1c.
> 
> Not that m1c is actually a fixed effects model, and can not be fit with 
> lmer, so you'll need to reconstruct the -2LL from another source. But 
> with a normal model that should not be a problem.
> 
> Fabian Scheipl answered in a post that the LRT chisquare(1) was wrong 
> and said that al least a mixture of chisquares should be used. I did not 
> explained myself very well, but I meant to recommend the mixture of 
> chisquare and pointmass when testing a single variance component.  
> Fabian's other method could be used, provided it will not take eons to 
> get the results for a high dimensional dataset.
> 
> But keep reading and you will find a good reason to hammer on me for 
> recomendations on df and computing p-values 8^D
> 
> Probably the whole point for having mcmcsamp was to actually not have to 
> answer that question on ddfs...
> Having said that, my approach in a particular project we are working on 
> is to do ddf=n-p... (n=length(y), p=ncols(x), x= full rank incidence 
> matrix of fixed effects). I'll run and take cover now... 8^D
> Seriously, in our particular design such approximation does not seem to 
> be way off, as I checked it with mcmcsamp for a handful of transcripts.
> 
> And finally, a more existential question:
> Paolo, do you really need a p-value?
> Sometimes in gene expression analysis we only need to rank genes for 
> evidence of differential expression.
> In those cases, you may well rank then using the t-statistic or the LRT 
> (not their p-values), especially if you have the same basic design 
> structure across all genes or transcripts (that you probably do).
> For example if you want to do enrichment of GO terms after the mm 
> analysis you could do a Kolmogorov-Smirnov type of GSE using the 
> t-statistics (or LRTs) to rank genes.
> 
> Hope this makes sense?
> Thanks!
> JP
> 
> 
> 
> Paolo Innocenti wrote:
>> Dear Douglas, Rolf, Juan and list,
>>
>> thank you very much for your replies.
>> I now got a good working model, and the use of refit and VarCorr will 
>> definitely help.
>>
>> I had a go with mcmcsamp(), and I must confirm that this approach is 
>> not feasible, both computationally and because if you get a "false 
>> convergence" for, say, 1 gene out of 20, it becomes impossible to go 
>> back and fix all the errors.
>>
>> So, the alternative approach seems more promising. If I understand 
>> correctly, you suggest to calculate a p-value for random effects out 
>> of the LRT (Likelihood ratio test), and use approximated DFs to 
>> calculate standard p-values for the fixed effects.
>>
>> I neved used this approach, so I appreciate if you can point me in the 
>> right direction.
>>
>> Random effects:
>> I'd need to compare the this three model:
>> m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>> m1b <- lmer(Y1 ~ sex + (1|line))
>> m1c <- lmer(Y1 ~ sex)
>>
>> and get the effect of the interaction from m1a-m1b,
>> and the effect of "line" from m1b-m1c.
>> It that correct?
>> Can anyone point me to any kind of documentation/examples to sort out 
>> the details?
>>
>> Fixed effects:
>> I don't really know where to get the approximated degrees of freedom. 
>> Can you point me to an example?
>>
>> Thanks again for all the help. Eventually, I'll be really happy to 
>> share my experience/code when everything is sorted, even if I doubt I 
>> can add anything helpful.
>>
>> Best,
>> paolo
>>
>> PS. I don't really understand what you mean by
>> sobj<-summary(result)
>> what object is your "result" here?
>>
>>
>>
>>
>> Juan Pedro Steibel wrote:
>>> Hello Paolo,
>>> We are also starting to use lmer for gene expression analysis 
>>> (genetical genomics too) so here are my thoughts.
>>>
>>> Having two equivalent parameterizations, I would go with the 
>>> computationally fastest one (a couple more miliseconds per model, 
>>> easily add up to many hours when analyzing highly dimensional 
>>> datasets). You can time the analysis for, say, 100 transcripts and go 
>>> from there.
>>>
>>> ~note: other users commented on your models not being equivalent~
>>>
>>> For p-values:
>>>
>>> You could use LRT to test for variance components.
>>>
>>> This is standard practice in genetic epidemiology: fit a model with 
>>> and without the random effect in question, then compare the log 
>>> (residual) likelihood ratio to a chi-square statistic. FDR can be 
>>> used on top of that to account for multiple tests. Of course, now you 
>>> have to fit three models (one null models for each VC), so your cpu  
>>> time has just multiplied by almost 3. Definitely using refit and 
>>> update will help with compute time when having so many models.
>>>
>>> I use sobj<-summary(result) as an intermediate step to get the info 
>>> (although this may add to the computational burden and other 
>>> suggestions you got may be faster and more efficient),
>>>
>>> then ask for slots:
>>> @REmat
>>> @coefs
>>> ...for getting estimates of variance components and fixed effects.
>>>
>>> @coefs gives you a t-statistic for fixed effects too... you could 
>>> take a stab at an approximated df method and compute an 
>>> (approximated) p-value.
>>> I know that doing so can attract a lot of criticism in this list, but 
>>> when you are fitting a several million models (10000s of transcripts 
>>> and 1000s of genomic positions as in my case), the mcmc approximation 
>>> is (unfortunately) not computationally feasible.
>>>
>>>
>>> Hope this helps!
>>> JP
>>>
>>>
>>>
>>> Paolo Innocenti wrote:
>>>> Dear Douglas and list,
>>>>
>>>> I am thinking about fitting a mixed model for a microarray 
>>>> experiment using lme4, since other specific software seems not 
>>>> suitable for my purposes. I'll briefly describe the model and kindly 
>>>> ask for suggestions on the model and the workflow I can use to get 
>>>> useful results.
>>>>
>>>> My response variable Y is gene expression levels for a given gene 
>>>> (say g_i) from 120 samples.
>>>> The factor I want to include are:
>>>>
>>>> Sex: fixed, two levels, M/F.
>>>> Line: 15 randomly picked genotypes from a large outbred population.
>>>>
>>>> I am interested in:
>>>> - if the gene is differentially expressed in the 2 sexes (effect of 
>>>> sex), in the 15 lines (effect of line) and the interaction of the two.
>>>>
>>>> - the variance component of line = how much of the variance is due 
>>>> to the genotype
>>>>
>>>> - the variance component of the interaction = the genetic variation 
>>>> for sexual dimorphism.
>>>>
>>>> Reading a bit of this mailing list, I came up with these three models:
>>>>
>>>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>>>>
>>>> or
>>>>
>>>> m2 <- lmer(Y1 ~ sex + (sex|line))
>>>>
>>>> or
>>>>
>>>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>>>>
>>>> Which should all be the same model (and indeed they have all the 
>>>> same residuals) but different parametrization (see self-contained 
>>>> example below).
>>>>
>>>> Now, in the light of my needs (see above), which model makes it 
>>>> easier to extract the components I need? Also, do they make 
>>>> different assumptions - as different levels of independency among 
>>>> levels of random factors?
>>>>
>>>> I will need to be able to extract the variance component values in 
>>>> an iterative process (i have 18.000 genes): is VarCorr() the way to go?
>>>>
>>>> VarCorr(m1)$'sex:line'[1]
>>>> VarCorr(m1)$'line'[1]
>>>>
>>>> Last two question: what is the easier way to assess, in an iterative 
>>>> process, normality of residuals, and what is a sensible way to 
>>>> assess significant differential expression of genes (since I guess I 
>>>> can't get p-values and then apply FDR correction?)
>>>>
>>>> Thanks a lot for reading so far and I'll be grateful for any kind of 
>>>> help.
>>>> paolo
>>>>
>>>> Self-contained example:
>>>>
>>>> Y1 <- as.numeric(
>>>> c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491", 
>>>>
>>>> "11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
>>>> "11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
>>>> "12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
>>>> "11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
>>>> "11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
>>>> "11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
>>>> "11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
>>>> "11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
>>>> "12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
>>>> "12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
>>>> "11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
>>>> "12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
>>>> "11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
>>>> "11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
>>>> "11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
>>>> "11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
>>>> "11.5300"))
>>>> sex <- factor(rep(c("F","M"), 15, each=4))
>>>> line <- factor(rep(1:15, each=8))
>>>> m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
>>>> m2 <- lmer(Y1 ~ sex + (sex|line))
>>>> m3 <- lmer(Y1 ~ sex + (0 + sex|line))
>>>> VarCorr(m1)$'sex:line'[1]
>>>> VarCorr(m1)$'line'[1]
>>>>
>>>> Output:
>>>>
>>>>>> m1
>>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 
>>>>> | sex:line)     AIC   BIC logLik deviance REMLdev
>>>>>  -91.13 -77.2  50.57   -111.1  -101.1
>>>>> Random effects:
>>>>>  Groups   Name        Variance  Std.Dev.
>>>>>  sex:line (Intercept) 0.0023237 0.048205
>>>>>  line     (Intercept) 0.0169393 0.130151
>>>>>  Residual             0.0168238 0.129707
>>>>> Number of obs: 120, groups: sex:line, 30; line, 15
>>>>>
>>>>> Fixed effects:
>>>>>             Estimate Std. Error t value
>>>>> (Intercept) 11.45977    0.03955  289.72
>>>>> sexM         0.52992    0.02951   17.96
>>>>>
>>>>> Correlation of Fixed Effects:
>>>>>      (Intr)
>>>>> sexM -0.373
>>>>>> m2
>>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line) 
>>>>>  AIC    BIC logLik deviance REMLdev
>>>>>  -90 -73.27     51   -112.1    -102
>>>>> Random effects:
>>>>>  Groups   Name        Variance  Std.Dev. Corr   line     
>>>>> (Intercept) 0.0152993 0.123691                 sexM        
>>>>> 0.0046474 0.068172 0.194  Residual             0.0168238 
>>>>> 0.129707       Number of obs: 120, groups: line, 15
>>>>>
>>>>> Fixed effects:
>>>>>             Estimate Std. Error t value
>>>>> (Intercept) 11.45977    0.03606   317.8
>>>>> sexM         0.52992    0.02951    18.0
>>>>>
>>>>> Correlation of Fixed Effects:
>>>>>      (Intr)
>>>>> sexM -0.161
>>>>>> m3
>>>>> Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line) 
>>>>>  AIC    BIC logLik deviance REMLdev
>>>>>  -90 -73.27     51   -112.1    -102
>>>>> Random effects:
>>>>>  Groups   Name Variance Std.Dev. Corr   line     sexF 0.015299 
>>>>> 0.12369                  sexM 0.023227 0.15240  0.899  
>>>>> Residual      0.016824 0.12971        Number of obs: 120, groups: 
>>>>> line, 15
>>>>>
>>>>> Fixed effects:
>>>>>             Estimate Std. Error t value
>>>>> (Intercept) 11.45977    0.03606   317.8
>>>>> sexM         0.52991    0.02951    18.0
>>>>>
>>>>> Correlation of Fixed Effects:
>>>>>      (Intr)
>>>>> sexM -0.161
>>>>
>>>>
>>>>
>>>
>>>
>>
> 
> 

-- 
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyvägen 18D
75236 Uppsala, Sweden