[BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?

Gordon K Smyth smyth at wehi.EDU.AU
Wed Jan 22 08:15:47 CET 2014

Hi Jose,

This is a big topic, and you raise a lot of issues.  I'll just try to make 
some brief responses.

On Mon, 20 Jan 2014, Jose Garcia wrote:

> Dear Gordon,
> First of all, thank you very much for your replies to our questions. For a
> non-statistician it is always a pleasure (and a MUST) bearing in mind the
> actual statistics behind the approaches, explained in a clear way showing
> its advantages, what they do, what they do not/cannot do, caveats and
> comparisons.
> Let me see if I got the point in my words:
> 1. log2FC (shrunken) would be the one to be used for DESeq2 and other
> RNA-Seq DE tools for Gene Set Enrichment Analysis pre-ranked but:
> 2. intercorrelation of genes will inflate p-values if GSEA (broad) is used
> with those log2FC.

Yes, a pre-ranked GSEA analysis will inflate significant (make p-values 
smaller).  You should only do a pre-ranked analysis if you have no 
alternative, for example if you have no replicates.  If you have no 
replicates, then as far as I know the only shrunk logFC available to you 
is that produced by predFC() in the edgeR package.

Note that a standard GSEA analysis, as described in the GSEA publications, 
requires lots of replicates and does not inflate significance.

> 3. This issue has been corrected in CAMERA approach but it works only 
> for normal-limma voom data. (Your next publication in Feb 2014 will show 
> how limma-voom can be used over DESeq (DESeq2?) and others with success.

At the time we wrote the voom paper, CAMERA and ROAST worked with voom but 
not with edgeR.  CAMERA and ROAST now work with edgeR as well.  For help 


> 4. CAMERA can calculate the enrichment for any Gene Set in MSigDB 
> against the background of other sets, not against all the others, "one 
> by one", taking account of the inter-correlations.

I'm not quite sure what you mean.  CAMERA does take account of 
inter-correlations, as you say.  It can test against the background of all 
other sets or against the background of all genes/probes in the genome. 
By default the background is all genes in the complete data but not in the 

> So, the solutions I see would be:
> 1. Use limma/voom/Camera/MSigDB for all the collections of sets I am
> interested in  (C2,C6,..) for all the gene sets individually and get their
> p-values of enrichment (intercorrelation OK). Should this work in giving me
> a list of top Enriched Gene Sets as GSEA(broad) does but with the
> correction for intercorrelation? I will explore CAMERA function promptly.

Yes.  Note that a regular GSEA analysis (not pre-ranked) also corrects for 
inter-gene correlations, but requires more replicates than CAMERA or ROAST 

> 2. Use DESeq2/shrunken foldChanges/ find a way to use them into Camera or
> Camera-like approach.

No, this would not make sense.

> In the end, I am not a supporter or neither approach, DESeq2 over 
> limma-voom or GSEA over Camera. What I would like is to have a 
> statistically correct approach that takes the DE log2FC with their 
> p-values and uses lists of relevant genes taking account of the size of 
> change(~biological relevance) (in the form of a rank list or average 
> FoldChanges or other) to calculate an enrichment. I chose the pre-ranked 
> GSEA because of the problem of having to define cutoffs of significance, 
> specially for fold changes.

None of the gene set testing methods require cutoffs.  This is one of 
things that distinguish them from overlap tests (like an standard GO 
analysis with Fisher tests).

> By the way, I was thinking in using other statistic to explore RNA-Seq
> results in a different way to classical DE that might result tricky to use
> with Gene Set Enrichment bearing in mind the intercorrelations problem. I
> am thinking about finding the correlation Pearson coefficient of each gene
> in my dataset for expression values in all samples to values of a set of
> genes of my choice (also in my dataset )and find whether most correlated
> genes (again without a cut off of R pearson would be better, like ranking
> from 1 to 0 or -1) are enriched in a selected pathway or category. What
> should I use in this case for gene set enrichment analysis?

Well, I have done almost the same analysis that you suggest in the past:


In that paper, I ranked genes by their correlation with RUNX1 and then 
tested for enrichment of various sets of genes using the geneSetTest() 
function of the limma package.  This is similar to a GSEA pre-ranked test 
but rather seemed to me to be simpler and more convenient.

Note that Pearson correlation is not ideal.  A better approach would be to 
regress all other genes on your gene of interest.  You can do this by 
adding the log-expression values (logCPM say) of the gene of interest as a 
column to the design matrix when running voom or edgeR.  We this approach, 
you can then used the ROAST or CAMERA functions to test for pathways that 
are correlated with your gene of interest while accounting for inter-gene 

> Reading your CAMERA paper discussion, I realise that the 
> intercorrelation problem is so because using DE data we should 
> underweight those gene sets with correlation because of your view, in my 
> opinion correct in most yet not all cases, that inter-gene correlation 
> 'reflects non-specific co-regulation, unrelated to the treatment 
> condition'. That may hold true for most DE-driven GSEA tests.

In the CAMERA method, we only use the non-specific inter-gene correlation 
that is not associated with the treatment condition.  So we are making an 
observation rather than an assumption.

> What about the test I last mentioned, where I am really interested in 
> genes correlating to a set of genes across my whole dataset, regardless 
> conditions (say all targets of the same miRNA)? And last, what if my set 
> of genes used to find R pearson were defined based on a classical DE 
> analysis of the same dataset, say with limma-voom?

Sounds like you would be double-dipping.

Best wishes

> Thanks again for your help
> Best wishes
> Jose
> 2014/1/19 Gordon K Smyth <smyth at wehi.edu.au>
>> Hi Jose,
>> Doing a one-sample t-test of the logFCs for a gene set is very similar to
>> the test proposed by
>>   http://www.ncbi.nlm.nih.gov/pubmed/20048385
>> (Actually that paper did a z-test of t-statistics, and criticized the test
>> of logFCs as ignoring unequal variances betweeen genes.]
>> You're right that this is different from GSEA, and in fact it has been
>> criticized directly by the GSEA authors themselves:
>>   http://www.ncbi.nlm.nih.gov/pubmed/23070592
>> The point I was making in my previous email, is that the criticism made by
>> the GSEA authors of the logFC test applies equally to the GSEA pre-ranked
>> test.  So I suspect the GSEA authors would not recommend the pre-ranked
>> test themselves unless there was no alternative (i.e., no replicates).
>> Best wishes
>> Gordon
>> --------- original message ------------
>> [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?
>> Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at hsr.it>
>> Fri Jan 17 17:52:01 CET 2014
>> Dear Mike,
>> Thanks for the confirmation, I remember talking to someone during the
>> Bioc2013 lab saying that same thing on shrunken log2FC but I do not know
>> why I thought that the value to use with pre-ranked was not the log2FC in
>> the results object but another one due to the shrinking procedure.
>> On the other hand, on the workflow from the lab, I remember that in 5.1
>> log2FC was used to test whether particular gene sets had an average log2FC
>> different from 0. That would use log2FC information but in a different way
>> compared to how I understand GSEA pre-ranked analysis (the procedure
>> published by the people in Broad). In the Broad pre-ranked strategy, the
>> enrichment using log2FCs from the top (positive and negative) is giving
>> you an enrichment score that should be more related, in my view, to
>> biological relevance since it is linked to highest log2FCs observed, in
>> absolute values, whereas the workflow in 5.1 would flag those gene sets
>> that do show a different behaviour but maybe very low just because we are
>> mathematically able to say so (because of the t-test used).
>> By the way, could the thresholded test implementation in DESeq2 >1.3, or
>> something similar, be used in a similar way to filter for those gene sets
>> with a more relevant enrichment using the 5.1 workflow?
>> Thanks again for the efforts
>> Jose
>> On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove at gmail.com>
>> wrote:
>> hi Garcia,
>> For DESeq2, we do recommend using shrunken log fold changes for gene set
>> tests, that is, the log2FoldChange column of a results object when the
>> default DESeq2 pipeline has been used.
>> For a suggested workflow, see section 5.1 of the PDF for the presentation
>> "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013
>> labs: http://bioconductor.org/help/course-materials/2013/BioC2013/
>> Mike
>> On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel
>> <garciamanteiga.josemanuel at hsr.it> wrote:
>> Hi,
>> I wanted the same and this is what gsea people replied to me some months
>> ago:
>> Hi,
>> Thank you for your interest in GSEA.
>> R GSEA does accept ranked list as an input.
>> Please note that R GSEA has not been actively maintained since 2005.
>> Regards,
>> I have not gone through the R GSEA again to check how pre-ranked lists can
>> be accepted, but the last sentence in their message made me hold a minute
>> and think in other possibilities.
>> For the moment I am still doing as you mentioned, using the rnk file with
>> java, then getting back again from results of gsea to R.
>> By the way, I would like to known which metric should be best to be used
>> for that kind of analysis when using RNA-Seq data coming from DESeq2
>> analysis. log2FC? p-values? Are they considered to be weighted, as GSEA
>> pre-ranked names them?
>> Thanks
>> Regards
>> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at schmolze.com>
>> wrote:
>> I want to do a GSEA entirely from within R, using genes ranked by my
>> own metric. At the moment I'm saving my ranked genes to a .rnk file,
>> then calling Broad's GSEA java program, then reading the resulting
>> output back into R (all I care about are the p-values in
>> gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls).
>> Cumbersome to say the least.
>> As far as I can tell, the Broad GSEA R script won't accept pre-ranked
>> genes, but maybe I'm wrong? If not, I'm interested in other options.
>> I'd like to specifically stick with the Broad GSEA algorithm if
>> possible.
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
> -- 
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list