[BioC] gene set enrichment analysis of RNA-Seq data

Mon May 7 17:20:22 CEST 2012

Dear Gordon,

Thank you very much for your detailed explanations and clarifications.
Very useful.
Thanks,

Paolo

-----Original Message-----
From: Gordon K Smyth [mailto:smyth at wehi.EDU.AU] 
Sent: Saturday, April 28, 2012 22:38
To: Paolo Guarnieri
Cc: Bioconductor mailing list
Subject: gene set enrichment analysis of RNA-Seq data

Dear Paolo,

Well, first of all, let me say that edgeR always takes into account the
library sizes, whether or not you use TMM or calcNormFactors.

The point though is that TMM is not a transformation.  It is specific to
the log-linear modelling approach used in edgeR.  It does not transform
the count data into a form that can be input to permutation-based
statistical methods designed for pathway analysis of microarray data. 
The edgeR User's Guide says

    "These adjustments are offsets in the models used for testing
    DE and do not transform the counts in any way."

Here is the point: the negative binomial approach is very powerful for
genewise or transcriptwise differential expression analysis, but it is
difficult to extend it to gene set analysis.  That is not to say that such
an extension can't be done, but it will be a lot of work.  On the other
hand, limma-voom gives a pipeline that works right now.

You ask about the differences between edgeR and limma-voom for genewise
differential expression of RNA-Seq data.  I don't think the differences
between the two methods are generally as great as you suggest.  The main
difference is that limma-voom is able to adapt to the amount of
gene-specific dispersion heterogeneity in the data, whereas edgeR does not
do this automatically.  In other words, limma estimates df.prior whereas
this is preset (but user changeable) in edgeR.  So, if you have data where
some genes show much greater inter-library variation than others, then
limma will give relatively stronger preference to genes that are
consistent between replicates, whereas edgeR will give relatively stronger
preference to genes with larger fold changes.  These are differences of
degrees only.  edgeR can be tuned to give behavior closer to limma,
although the tuning will be different for each dataset.

edgeR can separate biological from measurement variation, which is useful
for interpretation and which limma-voom can't do.  It is also potentially
more statistically powerful when some of the counts are small.  However
limma-voom is a reliable choice and, after doing many simulations, I feel
that I can recommend it with confidence.  My lab is using both packages in
our day to day work analysing RNA-Seq data for our collaborators.

I will try to clarify the relationship between edgeR and voom more
completely in a paper to be submitted.

Best wishes
Gordon

> Date: Thu, 26 Apr 2012 22:44:05 +0000
> From: Paolo Guarnieri <pg2296 at c2b2.columbia.edu>
> To: <bioconductor at stat.math.ethz.ch>
> Subject: Re: [BioC] gene set enrichment analysis of RNA-Seq data
>
> Dear Gordon,
>
> We (my colleagues and myself) read your post/papers (TMM, limma User 
> Guide, edgeR user Guide and paper and voom vignette) with great 
> interest and we are glad you took the time to address this issue.
>
> We have a couple of additional questions.
> In a previous email you said:
> "However RNA-Seq counts for different libraries can be of very 
> different sizes, and hence will be heteroscedastic.".
> Then the question is: why it is not sufficient to use the TMM 
> normalized data as it takes into account the different library size, 
> but instead you propose to follow the voom transformation procedure?
>
> Additionally we find that differentially expressed genes identified by 
> edgeR are substantially different from those identified by limma after 
> voom transformation. Whereas we expect this behavior, due to the 
> different statistical model and the transformation itself, it is 
> always a reason of concern.
>
> Best,
> Paolo

> Gordon K Smyth <smyth at ...> writes:
>>
>> Dear Julie,
>>
>> A good question.  As far as I know, there is as yet no such method.  
>> What I am doing for this purpose for the time being is to use voom() 
>> in the limma package to transform the RNA-Seq counts to a scale on 
>> which microarray methods can be used, then using roast().  See page 
>> 104 of the limma User's Guide for examples of this:
>>
>> http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/u
>> sersguide.pdf
>>
>> Note that roast() is a self-contained gene set test with the ability 
>> to use linear models and weights:
>>
>>    http://www.ncbi.nlm.nih.gov/pubmed/20610611
>>
>> Another gene set enrichment option that works fine with RNA-Seq data 
>> is camera().  This is a competitive test, but without the usual 
>> disadvantage of gene sampling in that it estimates and adjusts for 
>> inter-gene correlation.  camera() is currently setup to automatically 
>> use the weights that come out of voom(), meaning that camera() 
>> respects the mean-variance relationship of RNA-Seq data.  We have 
>> used it successfully on RNA-Seq data.
>>
>> Best wishes
>> Gordon
>>
>> ------------ original message ------------------ [BioC] gene set 
>> enrichment analysis of RNA-Seq data Julie Leonard julie.leonard at 
>> syngenta.com Thu Apr 12 23:06:54 CEST 2012
>>
>> I was wondering if anyone is aware of a gene set enrichment algorithm 
>> for RNA-Seq data that:
>>
>> 1) does not require a specification of differentially expressed (DE) 
>> genes (i.e.no need to use a hard p-value threshold cutoff for 
>> determining the DE gene
>> list)
>>
>> 2) uses subject sampling instead of gene sampling to obtain the 
>> p-value (i.e.this would maintain gene-gene correlations)
>>
>> Basically, I'm looking for a
>> self-contained/subject sampling method (e.g.
>> SAM-GS for microarray data) or a "hybrid" method (e.g. GSEA for 
>> microarray data).  The only gene set enrichment algorithm that I am 
>> aware of for RNA-Seq data is GOSeq, but it uses a competitive/gene 
>> sampling method (i.e. Fisher's Exact Test).
>> Note, the ideas of self-contained vs competitive and subject sampling 
>> vs gene sampling come from the following paper:  Goeman JJ, Bhlmann 
>> P.Analyzing gene expression data in terms of gene sets:
>> methodological issues. Bioinformatics. 2007 Apr 15;23(8)
>>
>> Something like GSEA-SNP is close to what I want.
>> It uses a test-statistic that is suitable for discrete data and uses 
>> subject sampling to calculate the p-values.
>>
>> Thanks,
>> Julie

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}