[BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?

Gordon K Smyth smyth at wehi.EDU.AU
Fri Jan 24 00:18:14 CET 2014

On Thu, 23 Jan 2014, Jose Garcia wrote:

> Dear Gordon,
> Thanks for the reply. I will try the regression using logCPM values. If I
> have more than one gene to regress to, I guess I could use some design
> formula that could try to regress to all of them, with a the highest
> coefficient. What could be such formula?

If just two or three genes to regress on, then you could possibly put all 
of them in the design matrix.  Otherwise I would consider using the first 
principle component of the log expression values.


> For R pearson I used the Averaged R coefficient and applied a Rank Product
> approach on them.
> With the regression approach, In the end I would finish with a limma-coef
> (<0 anticorrealtion >0 correlation with a p.value) that I could use for
> geneSetTest.
> Thank you very much for the suggestion.
> Best
> Jose
> 2014/1/22 Gordon K Smyth <smyth at wehi.edu.au>
>> Hi Jose,
>> This is a big topic, and you raise a lot of issues.  I'll just try to make
>> some brief responses.
>> On Mon, 20 Jan 2014, Jose Garcia wrote:
>>> Dear Gordon,
>>> First of all, thank you very much for your replies to our questions. For
>> a
>>> non-statistician it is always a pleasure (and a MUST) bearing in mind the
>>> actual statistics behind the approaches, explained in a clear way showing
>>> its advantages, what they do, what they do not/cannot do, caveats and
>>> comparisons.
>>> Let me see if I got the point in my words:
>>> 1. log2FC (shrunken) would be the one to be used for DESeq2 and other
>>> RNA-Seq DE tools for Gene Set Enrichment Analysis pre-ranked but:
>>> 2. intercorrelation of genes will inflate p-values if GSEA (broad) is
>> used
>>> with those log2FC.
>> Yes, a pre-ranked GSEA analysis will inflate significant (make p-values
>> smaller).  You should only do a pre-ranked analysis if you have no
>> alternative, for example if you have no replicates.  If you have no
>> replicates, then as far as I know the only shrunk logFC available to you
>> is that produced by predFC() in the edgeR package.
>> Note that a standard GSEA analysis, as described in the GSEA publications,
>> requires lots of replicates and does not inflate significance.
>>> 3. This issue has been corrected in CAMERA approach but it works only
>>> for normal-limma voom data. (Your next publication in Feb 2014 will show
>>> how limma-voom can be used over DESeq (DESeq2?) and others with success.
>> At the time we wrote the voom paper, CAMERA and ROAST worked with voom but
>> not with edgeR.  CAMERA and ROAST now work with edgeR as well.  For help
>> see
>>    library(edgeR)
>>    ?camera.DGEList
>>> 4. CAMERA can calculate the enrichment for any Gene Set in MSigDB
>>> against the background of other sets, not against all the others, "one
>>> by one", taking account of the inter-correlations.
>> I'm not quite sure what you mean.  CAMERA does take account of
>> inter-correlations, as you say.  It can test against the background of all
>> other sets or against the background of all genes/probes in the genome.
>> By default the background is all genes in the complete data but not in the
>> set.
>>> So, the solutions I see would be:
>>> 1. Use limma/voom/Camera/MSigDB for all the collections of sets I am
>>> interested in  (C2,C6,..) for all the gene sets individually and get
>> their
>>> p-values of enrichment (intercorrelation OK). Should this work in giving
>> me
>>> a list of top Enriched Gene Sets as GSEA(broad) does but with the
>>> correction for intercorrelation? I will explore CAMERA function promptly.
>> Yes.  Note that a regular GSEA analysis (not pre-ranked) also corrects for
>> inter-gene correlations, but requires more replicates than CAMERA or ROAST
>> do.
>>> 2. Use DESeq2/shrunken foldChanges/ find a way to use them into Camera or
>>> Camera-like approach.
>> No, this would not make sense.
>>> In the end, I am not a supporter or neither approach, DESeq2 over
>>> limma-voom or GSEA over Camera. What I would like is to have a
>>> statistically correct approach that takes the DE log2FC with their
>>> p-values and uses lists of relevant genes taking account of the size of
>>> change(~biological relevance) (in the form of a rank list or average
>>> FoldChanges or other) to calculate an enrichment. I chose the pre-ranked
>>> GSEA because of the problem of having to define cutoffs of significance,
>>> specially for fold changes.
>> None of the gene set testing methods require cutoffs.  This is one of
>> things that distinguish them from overlap tests (like an standard GO
>> analysis with Fisher tests).
>>> By the way, I was thinking in using other statistic to explore RNA-Seq
>>> results in a different way to classical DE that might result tricky to
>> use
>>> with Gene Set Enrichment bearing in mind the intercorrelations problem. I
>>> am thinking about finding the correlation Pearson coefficient of each
>> gene
>>> in my dataset for expression values in all samples to values of a set of
>>> genes of my choice (also in my dataset )and find whether most correlated
>>> genes (again without a cut off of R pearson would be better, like ranking
>>> from 1 to 0 or -1) are enriched in a selected pathway or category. What
>>> should I use in this case for gene set enrichment analysis?
>> Well, I have done almost the same analysis that you suggest in the past:
>>    http://www.biomedcentral.com/1471-2164/9/363
>> In that paper, I ranked genes by their correlation with RUNX1 and then
>> tested for enrichment of various sets of genes using the geneSetTest()
>> function of the limma package.  This is similar to a GSEA pre-ranked test
>> but rather seemed to me to be simpler and more convenient.
>> Note that Pearson correlation is not ideal.  A better approach would be to
>> regress all other genes on your gene of interest.  You can do this by
>> adding the log-expression values (logCPM say) of the gene of interest as a
>> column to the design matrix when running voom or edgeR.  We this approach,
>> you can then used the ROAST or CAMERA functions to test for pathways that
>> are correlated with your gene of interest while accounting for inter-gene
>> correlations.
>>> Reading your CAMERA paper discussion, I realise that the
>>> intercorrelation problem is so because using DE data we should
>>> underweight those gene sets with correlation because of your view, in my
>>> opinion correct in most yet not all cases, that inter-gene correlation
>>> 'reflects non-specific co-regulation, unrelated to the treatment
>>> condition'. That may hold true for most DE-driven GSEA tests.
>> In the CAMERA method, we only use the non-specific inter-gene correlation
>> that is not associated with the treatment condition.  So we are making an
>> observation rather than an assumption.
>>> What about the test I last mentioned, where I am really interested in
>>> genes correlating to a set of genes across my whole dataset, regardless
>>> conditions (say all targets of the same miRNA)? And last, what if my set
>>> of genes used to find R pearson were defined based on a classical DE
>>> analysis of the same dataset, say with limma-voom?
>> Sounds like you would be double-dipping.
>> Best wishes
>> Gordon
>>> Thanks again for your help
>>> Best wishes
>>> Jose
>>> 2014/1/19 Gordon K Smyth <smyth at wehi.edu.au>
>>>> Hi Jose,
>>>> Doing a one-sample t-test of the logFCs for a gene set is very similar
>> to
>>>> the test proposed by
>>>>   http://www.ncbi.nlm.nih.gov/pubmed/20048385
>>>> (Actually that paper did a z-test of t-statistics, and criticized the
>> test
>>>> of logFCs as ignoring unequal variances betweeen genes.]
>>>> You're right that this is different from GSEA, and in fact it has been
>>>> criticized directly by the GSEA authors themselves:
>>>>   http://www.ncbi.nlm.nih.gov/pubmed/23070592
>>>> The point I was making in my previous email, is that the criticism made
>> by
>>>> the GSEA authors of the logFC test applies equally to the GSEA
>> pre-ranked
>>>> test.  So I suspect the GSEA authors would not recommend the pre-ranked
>>>> test themselves unless there was no alternative (i.e., no replicates).
>>>> Best wishes
>>>> Gordon
>>>> --------- original message ------------
>>>> [BioC] pre-ranked GSEA within R? + Best DESeq2/limma-voom metric?
>>>> Garcia Manteiga Jose Manuel <garciamanteiga.josemanuel at hsr.it>
>>>> Fri Jan 17 17:52:01 CET 2014
>>>> Dear Mike,
>>>> Thanks for the confirmation, I remember talking to someone during the
>>>> Bioc2013 lab saying that same thing on shrunken log2FC but I do not know
>>>> why I thought that the value to use with pre-ranked was not the log2FC
>> in
>>>> the results object but another one due to the shrinking procedure.
>>>> On the other hand, on the workflow from the lab, I remember that in 5.1
>>>> log2FC was used to test whether particular gene sets had an average
>> log2FC
>>>> different from 0. That would use log2FC information but in a different
>> way
>>>> compared to how I understand GSEA pre-ranked analysis (the procedure
>>>> published by the people in Broad). In the Broad pre-ranked strategy, the
>>>> enrichment using log2FCs from the top (positive and negative) is giving
>>>> you an enrichment score that should be more related, in my view, to
>>>> biological relevance since it is linked to highest log2FCs observed, in
>>>> absolute values, whereas the workflow in 5.1 would flag those gene sets
>>>> that do show a different behaviour but maybe very low just because we
>> are
>>>> mathematically able to say so (because of the t-test used).
>>>> By the way, could the thresholded test implementation in DESeq2 >1.3, or
>>>> something similar, be used in a similar way to filter for those gene
>> sets
>>>> with a more relevant enrichment using the 5.1 workflow?
>>>> Thanks again for the efforts
>>>> Jose
>>>> On 17 Jan 2014, at 16:10, Michael Love <michaelisaiahlove at gmail.com>
>>>> wrote:
>>>> hi Garcia,
>>>> For DESeq2, we do recommend using shrunken log fold changes for gene set
>>>> tests, that is, the log2FoldChange column of a results object when the
>>>> default DESeq2 pipeline has been used.
>>>> For a suggested workflow, see section 5.1 of the PDF for the
>> presentation
>>>> "DESeq2/DEXSeq / Alejandro Reyes, Wolfgang Huber", one of the Bioc2013
>>>> labs: http://bioconductor.org/help/course-materials/2013/BioC2013/
>>>> Mike
>>>> On Fri, Jan 17, 2014 at 3:30 AM, Garcia Manteiga Jose Manuel
>>>> <garciamanteiga.josemanuel at hsr.it> wrote:
>>>> Hi,
>>>> I wanted the same and this is what gsea people replied to me some months
>>>> ago:
>>>> Hi,
>>>> Thank you for your interest in GSEA.
>>>> R GSEA does accept ranked list as an input.
>>>> Please note that R GSEA has not been actively maintained since 2005.
>>>> Regards,
>>>> I have not gone through the R GSEA again to check how pre-ranked lists
>> can
>>>> be accepted, but the last sentence in their message made me hold a
>> minute
>>>> and think in other possibilities.
>>>> For the moment I am still doing as you mentioned, using the rnk file
>> with
>>>> java, then getting back again from results of gsea to R.
>>>> By the way, I would like to known which metric should be best to be used
>>>> for that kind of analysis when using RNA-Seq data coming from DESeq2
>>>> analysis. log2FC? p-values? Are they considered to be weighted, as GSEA
>>>> pre-ranked names them?
>>>> Thanks
>>>> Regards
>>>> On 17 Jan 2014, at 00:18, Daniel Schmolze <bioconductor at schmolze.com>
>>>> wrote:
>>>> I want to do a GSEA entirely from within R, using genes ranked by my
>>>> own metric. At the moment I'm saving my ranked genes to a .rnk file,
>>>> then calling Broad's GSEA java program, then reading the resulting
>>>> output back into R (all I care about are the p-values in
>>>> gsea_report_for_na_pos_####.xls and gsea_report_for_na_neg_####.xls).
>>>> Cumbersome to say the least.
>>>> As far as I can tell, the Broad GSEA R script won't accept pre-ranked
>>>> genes, but maybe I'm wrong? If not, I'm interested in other options.
>>>> I'd like to specifically stick with the Broad GSEA algorithm if
>>>> possible.
>>>> ______________________________________________________________________
>>>> The information in this email is confidential and intended solely for
>> the
>>>> addressee.
>>>> You must not disclose, forward, print or use it without the permission
>> of
>>>> the sender.
>>>> ______________________________________________________________________
>>> --
>>> Jose M. Garcia Manteiga PhD
>>> Research Assistant - Data Analysis in Functional Genomics
>>> Center for Translational Genomics and BioInformatics
>>> Dibit2-Basilica, 3A3
>>> San Raffaele Scientific Institute
>>> Via Olgettina 58, 20132 Milano (MI), Italy
>>> Tel: +39-02-2643-4751
>>> e-mail: garciamanteiga.josemanuel at hsr.it
>> ______________________________________________________________________
>> The information in this email is confidential and intended solely for the
>> addressee.
>> You must not disclose, forward, print or use it without the permission of
>> the sender.
>> ______________________________________________________________________
> -- 
> Jose M. Garcia Manteiga PhD
> Research Assistant - Data Analysis in Functional Genomics
> Center for Translational Genomics and BioInformatics
> Dibit2-Basilica, 3A3
> San Raffaele Scientific Institute
> Via Olgettina 58, 20132 Milano (MI), Italy
> Tel: +39-02-2643-4751
> e-mail: garciamanteiga.josemanuel at hsr.it

The information in this email is confidential and intend...{{dropped:4}}

More information about the Bioconductor mailing list