[BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom

Gordon K Smyth smyth at wehi.EDU.AU
Sat Jan 18 00:53:03 CET 2014

Dear Garcia and Daniel,

If you must do a GSEA pre-ranked analysis, then I think the best statistic 
to use is a shrunken fold change, as it is the best predictor of the size 
of the biological effect for each gene.  (Statistical significance meauses 
something slightly different.)  The predFC() and glmFit() functions of the 
edgeR package have offered "predictive fold changes" for quite a few 
years, and they would still be my choice for this purpose.  voom and 
DESeq2 are more recent alternatives, but I still recommend predFC() with 
prior.count=3 or 5 for this purpose.  And predFC() will work even without 
biological replicates.

However a GSEA pre-ranked analysis does not (and cannot) give correct 
p-values, regardless of the metric used, because it does not take account 
of inter-gene correlations.  This has been shown by a number of published 
papers.  For example, Wu et al (2012) showed that pre-ranked gene set 
analyses give spectactularly wrong p-values even when the inter-gene 
correlation is as low as 0.05.  Have a look at Table 1 of:


The GSEA pre-ranked analysis has similar behaviour to "geneSetTest" in 
this table.  See the remark about GSEA in the last sentence of page 9.

If your data has biological replicates, then more reliable methods are 
readily available.  I don't know whether you want to test one gene set or 
the whole MSigDb collection.  Either way, the mroast() or camera() methods 
of the limma package work seamlessly with either voom or edgeR for RNA-seq 
data, and either will take account inter-gene correlations.  See for 
example the roast() and camera() tests presented in the first case study 


(This paper will appear in Genome Biology next month.)

Best wishes

> Date: Fri, 17 Jan 2014 09:30:47 +0100
> From: Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at hsr.it>
> To: Daniel Schmolze <bioconductor at schmolze.com>
> Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
> Subject: Re: [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom
> 	metric?
> Hi,
> I wanted the same and this is what gsea people replied to me some months ago:
> Hi,
> Thank you for your interest in GSEA.
> R GSEA does accept ranked list as an input.
> Please note that R GSEA has not been actively maintained since 2005.
> Regards,
> I have not gone through the R GSEA again to check how pre-ranked lists 
> can be accepted, but the last sentence in their message made me hold a 
> minute and think in other possibilities.

> For the moment I am still doing as you mentioned, using the rnk file 
> with java, then getting back again from results of gsea to R.
> By the way, I would like to known which metric should be best to be used 
> for that kind of analysis when using RNA-Seq data coming from DESeq2 
> analysis. log2FC? p-values? Are they considered to be weighted, as GSEA 
> pre-ranked names them?

> Thanks
> Regards
> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at schmolze.com<mailto:bioconductor at schmolze.com>> wrote:
> I want to do a GSEA entirely from within R, using genes ranked by my
> own metric. At the moment I'm saving my ranked genes to a .rnk file,
> then calling Broad's GSEA java program, then reading the resulting
> output back into R (all I care about are the p-values in
> gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls).
> Cumbersome to say the least.
> As far as I can tell, the Broad GSEA R script won't accept pre-ranked
> genes, but maybe I'm wrong? If not, I'm interested in other options.
> I'd like to specifically stick with the Broad GSEA algorithm if
> possible.

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list