[BioC] gene set enrichment analysis of RNA-Seq data

Fri Apr 27 00:44:05 CEST 2012

Gordon K Smyth <smyth at ...> writes:

> 
> Dear Julie,
> 
> A good question.  As far as I know, there is as yet no such method.  What 
> I am doing for this purpose for the time being is to use voom() in the 
> limma package to transform the RNA-Seq counts to a scale on which 
> microarray methods can be used, then using roast().  See page 104 of the 
> limma User's Guide for examples of this:
> 
> http://bioconductor.org/packages/2.10/bioc/vignettes/limma/inst/doc/usersguide.pdf
> 
> Note that roast() is a self-contained gene set test with the ability to 
> use linear models and weights:
> 
>    http://www.ncbi.nlm.nih.gov/pubmed/20610611
> 
> Another gene set enrichment option that works fine with RNA-Seq data is 
> camera().  This is a competitive test, but without the usual disadvantage 
> of gene sampling in that it estimates and adjusts for inter-gene 
> correlation.  camera() is currently setup to automatically use the weights 
> that come out of voom(), meaning that camera() respects the mean-variance 
> relationship of RNA-Seq data.  We have used it successfully on RNA-Seq 
> data.
> 
> Best wishes
> Gordon
> 
> ------------ original message ------------------
> [BioC] gene set enrichment analysis of RNA-Seq data
> Julie Leonard julie.leonard at syngenta.com
> Thu Apr 12 23:06:54 CEST 2012
> 
> I was wondering if anyone is aware of a gene
> set enrichment algorithm for RNA-Seq data that:
> 
> 1) does not require a specification of differentially
> expressed (DE) genes (i.e.no need to use a hard
> p-value threshold cutoff for determining the DE gene
> list)
> 
> 2) uses subject sampling instead of gene sampling
> to obtain the p-value (i.e.this would maintain
> gene-gene correlations)
> 
> Basically, I'm looking for a
> self-contained/subject sampling method (e.g.
> SAM-GS for microarray data) or a "hybrid" method
> (e.g. GSEA for microarray data).  The only gene set
> enrichment algorithm that I am aware of for RNA-Seq
> data is GOSeq, but it uses a competitive/gene
> sampling method (i.e. Fisher's Exact Test).
> Note, the ideas of self-contained vs competitive and
> subject sampling vs gene sampling come from the
> following paper:  Goeman JJ, Bhlmann P.Analyzing
> gene expression data in terms of gene sets:
> methodological issues. Bioinformatics. 2007 Apr 15;23(8)
> 
> Something like GSEA-SNP is close to what I want.
> It uses a test-statistic that is suitable for discrete data
> and uses subject sampling to calculate the p-values.
> 
> Thanks,
> Julie
> 
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:4}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at ...
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> 

Dear Gordon,

We (my colleagues and myself) read your post/papers (TMM, limma User Guide,
edgeR user Guide and paper and voom vignette) with great interest and we are
glad you took the time to address this issue.

We have a couple of additional questions.
In a previous email you said:  
"However RNA-Seq counts for different libraries
can be of very different sizes, and hence will be heteroscedastic.". 
Then the question is: why it is not sufficient to use the TMM normalized 
data as it takes into account the different library size, but instead you
propose to follow the voom transformation procedure?

Additionally we find that differentially expressed genes identified by 
edgeR are substantially different from those identified by limma after 
voom transformation. 
Whereas we expect this behavior, due to the different statistical 
model and the transformation itself, 
it is always a reason of concern. 

Best,

Paolo