[BioC] Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data
Gordon K Smyth
smyth at wehi.EDU.AU
Sun Jun 30 11:17:53 CEST 2013
On Sun, 30 Jun 2013, Tom Wenseleers wrote:
> Dear Gordon,
> Many thanks for your message! Of course, I agree with you that mixed
> model theory as such is pretty complicated and still to some extent in
> development. But what I find easier is the specification part of the
> model, where all you have to do is specify the formula that captures the
> dependencies/sources of correlation in the data and the fixed factors
> plus possible interactions that you want to fit,
You can use the same model formula (for fixed factors) with limma. This is
true even for a two colour design, provided you use lmscFit() instead of
lmFit().
> as opposed to the design matrix in limma, which (at least I sometimes
> find) tricky to specify for more complex multifactorial designs (even
> with functions designMatrix() and contrast.fit()).
The designMatrix() function hasn't existed for nine years. (It was
renamed to modelMatrix.) Are you using an up-to-date version of limma?
> Also, the new R packages lmerTest and lsmeans would make a mixed model
> approach much easier than before, as you can now easily get the desired
> tests and log2 ratio contrasts and lsmeans (plus confidence limits) for
> the different treatment levels out of the lme4 fit (as you could in SAS
> before already for a long time, but alas not in R). I am not saying that
> all this would be trivial, but it could be made quite easy with a
> dedicated package I believe.
>
> The method would indeed analyze two-colour designs as single-colour
> ones, and take into account the pairing of R and G labels on the same
> array using a random factor. The difference though would be that many
> sources of dependency/correlation could readily be taken into account,
> e.g. repeated use of the same sample due to dye swaps and technical
> replication, replicate spots on the same slide, repeated use of samples
> with the same genetic background, etc etc... Maanova does indeed allow
> for this approach to some extent, although I believe lme4 combined with
> lsmeans and lmerTest would be much faster and would allow for more
> complex designs (e.g. with hierarchically nested random factors,
> multiple crossed random factors, random slopes models, etc, the SAS
> example e.g.,
> http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_hpmixed_sect035.htm,
> could not be fit using Maanova I believe, whereas I could do it no probs
> with lme4). And the possible generalization to RNAseq data I think would
> be very interesting. Maybe I'll start doing some simulations to
> systematically compare the statistical power of these different
> approaches for some different experimental designs (many thanks also for
> the Altman paper, in my case I was surprised to find though that I
> seemingly got a better statistical power even with a simple
> single-factor (2 group) 2-colour randomized block design analysed with a
> mixed model, for which the Altman paper claims that the power of the
> different approaches should be comparable).
Is your simple design a direct comparison of A vs B on the same arrays,
without any common reference? If so, then I don't think that there is any
extra information in the data that could be extracted by a mixed model.
So, if the mixed model is giving a lot more DE genes, I would worry that
it is not holding its size correctly. It would be a good idea to do some
simulations to check that the mixed model method is controlling the type I
error rate at the correct level.
Best wishes
Gordon
> Oh yes, and was meaning RGList of course :-)
>
> Well, I'll keep you posted if I make some further progress with this -
> and let me know if you would be interested in co-developing or helping
> out in some way with such a package!
>
> Cheers,
> Tom
>
> -----Original Message-----
> From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU]
> Sent: 30 June 2013 05:42
> To: Tom Wenseleers
> Cc: Bioconductor mailing list
> Subject: Statistical power of limma vs mixed model approach to analyze microarray and differentialproteomics/peptidomics experiments and Poisson/binomial mixed models for RNAseq data
>
> Dear Tom,
>
> I must say that this is first time I have heard anyone suggest that a
> mixed linear model is easier or simpler than an ordinary linear model.
> Having used the lme4 package myself, the opposite seems true to me by a
> very long way.
>
> Do you know that mixed models have been proposed and implemented several
> times for microarray data? For example the maanova package or the
> lmscFit() separate channel approach of the limma package. A recent
> paper by Naomi Altman and myself gives a comparison of some of the
> approaches:
>
> http://www.biomedcentral.com/1471-2105/14/165
>
> For some designs, a mixed model or separate approach can indeed recover
> more information than a classic log-ratio analysis. I don't fully
> follow the approach that you describe, but it sounds to have some
> similarities with the limma separate channel approach as well as with
> maanova.
>
> limma can be generalized to RNA-seq, see
>
> http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
>
> limma does allow for a treatment of randomized block designs but, apart
> from this exception, it is certainly true that RNA-seq analysis methods
> do not allow random effect models. So there is a gap in the market if
> you are keen to fill it.
>
> There isn't any data object class called RGLimma that I know of. Do you
> mean RGList?
>
> Best wishes
> Gordon
>
>> Date: Sat, 29 Jun 2013 02:49:53 +0000
>> From: Tom Wenseleers <Tom.Wenseleers at bio.kuleuven.be>
>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>> Subject: [BioC] Statistical power of limma vs mixed model approach to
>> analyze microarray and differential proteomics/peptidomics experiments
>> and Poisson/binomial mixed models for RNAseq data
>>
>> Dear all,
>
>> I was just analyzing some data from a differential (dual label LC/MS)
>> single factor peptidomics experiment using both a per-feature mixed
>> model approach (using lme4's lmer function, using a model of the form
>> intensity/abundance ~ (1|run)+label+treatment, and testing
>> significance and calculating least square means of the differential
>> expression ratios for the different treatments using lmerTest and
>> lsmeans, using a Satterthwaite df approximation) and an empirical
>> Bayes standard deviation shrinkage method as implemented in limma (by
>> putting my data in an RGLimma object). To my surprise, I am finding
>> more statistically significantly differentially expressed features
>> using the mixed model approach than using limma (independent of the
>> types of normalizations I use in limma, and I did check that the sd vs abundance was flat).
>> Before, I have also observed the same phenomenon in some microarray
>> datasets.
>>
>> I am wondering if this would point towards a mixed model approach
>> having superior statistical power than limma, even for this very simple design.
>> Could this be the case? I only did 8 replicate runs in this case
>> (corresponding to arrays in 2-color microarray exps), so not terribly
>> many... Could it be that limma has superior statistical power for the
>> analysis of relatively small designs with few experimental replicates
>> (where the SD shrinkage could be beneficial), but that for larger
>> number of replicates and also for complex designs (in particular when
>> there are multiple sources of dependencies in the data) a mixed model
>> approach would work better?
>>
>> Also, would a mixed model approach in general not be much easier to
>> specify (just requiring the model formula to be specified, as opposed
>> to all the contrasts, which becomes pretty tedious for complex
>> designs) and statistically be more flexible and powerful for
>> multifactorial experiments (e.g. easily allowing for treatment
>> interaction effects to be tested, and also allowing for multiple crossed random factors, e.g.
>> to take into account several dependencies in the data, e.g. owing to
>> spatial correlations in the microarray slides/print tip effects, other
>> random factors, e.g. due to the use of samples with multiple genetic
>> backgrounds or the presence of repeated measures factors, etc). For
>> repeated measures designs I would also think that mixed model fits
>> obtained using nlme could be even more flexible, e.g. allowing for
>> various custom error covariance structures (e.g. to take into account
>> temporal autocorrelations, etc); in fact, even lmer supports
>> unstructured covariance mode! ls (which would allow variances to be
>> different in your different groups, which I think could be quite
>> common). Model selection, on the other hand, would appear a little
>> more tricky, but also feasible, e.g. by minimizing the AIC of a model
>> that is fit over all features/genes combined (as in
>> http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_hpmixed_sect035.htm#statug.hpmixed.hpmeg4b).
>>
>> Finally, an important advantage that I would see of using this
>> approach is that it would readily generalize to RNAseq data if one
>> would use generalized mixed models (glmer's) with a Poisson error
>> structure (correcting for the total nr of mapped reads, etc and other
>> important
>> covariates) (or, alternatively, using a binomial error structure, to
>> analyze the proportion of reads mapping to a particular gene).
>> Overdispersion in the data in this approach could readily be taken
>> into account by including an observation-level random factor in the
>> model, cf.
>> https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015867.html.
>> The advantage here again would be that complex designs, e.g. with
>> multiple crossed random factors, could readily be taken into account
>> (to my understanding, edgeR or DEseq do not allow for this, right?).
>> Would anyone perhaps have any thoughts why not to go for such a
>> generic approach?
>>
>> Would there perhaps be any market for a package to analyze
>> differential expression in limma RGLimma and ESet data (or
>> CountDataSet or DGEList objects for RNAseq data) using the approach I
>> outline above? If there is enough interest, I would be keen to develop
>> it, as it would seem pretty straightforward to do...
>>
>> Cheers,
>> Tom Wenseleers
>>
>>
>> ______________________________________________________________________
>> _________________
>>
>> Prof. Tom Wenseleers
>> * Lab. of Socioecology and Social Evolution
>> Dept. of Biology
>> Zoological Institute
>> K.U.Leuven
>> Naamsestraat 59, box 2466
>> B-3000 Leuven
>> Belgium
>> * +32 (0)16 32 39 64 / +32 (0)472 40 45 96
>> * tom.wenseleers at bio.kuleuven.be
>> http://bio.kuleuven.be/ento/wenseleers/twenseleers.htm
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list