[BioC] 2 issues about enriched gene sets via Roast

Gordon K Smyth smyth at wehi.EDU.AU
Sat Jan 25 08:12:21 CET 2014


Dear Julie,

When you do a rotation or permutation test, the smallest possible p-value 
that can be achieved depends on the number of distinct rotations or 
permutations that have been performed.  You appear to be using nrot=9999 
rotations, so the smallest one-sided p-value that is possible is

   p = 1 / (nrot+1) = 1e-4

The small possible two-sided p-value possible therefore is twice this, 
which is 2e-4.

Any gene set that contains lots of differential expression, so that the 
observed statistic is greater than any of the rotated statistics, will be 
assigned this minimum p-value.

One can resolve these small p-values further by doing more rotations.  The 
more rotations than are done, the fewer gene sets will sit on the minimum.

Best wishes
Gordon

PS. If you're not sure where the above p-value formula comes from, see 
Section 4 of:

  http://www.statsci.org/smyth/pubs/PermPValuesPreprint.pdf


> Date: Thu, 23 Jan 2014 17:27:46 -0500
> From: <julie.leonard at syngenta.com>
> To: <bioconductor at r-project.org>
> Subject: [BioC] 2 issues about enriched gene sets via Roast
>
> Hi.

> I am using Roast to perform gene set enrichment analysis after doing 
> differential expression analysis in edgeR.  In this particular study, I 
> had 2 variables which I joined to create a single factor in the linear 
> model:  y ~ 0 + combo_variable.  Looking at the variance in the data, 
> most of the variance was due to 1 variable and there was very little 
> variance due to the other variable.  Thus, large numbers of genes 
> (~10,000) were found to be differentially expressed when testing the 
> contrast for one variable and very few genes (~200) were found to be 
> differentially expressed when testing the contrast for the other 
> variable.  When I ran roast for each of these 2 contrasts, the one that 
> had lots of differentially expressed genes found almost all of the gene 
> sets to be enriched.  This is understandable since there were lots of 
> genes differentially expressed, but my problem is that most of the gene 
> sets had the same FDRs.  Thus I can't even narrow down the list of 
> enriched! gene sets by using a more stringent FDR cutoff.  A subset of 
> the output is shown below.  Why would all of these gene sets have the 
> same p-values and thus the same FDRs??
>
> NGenes
>
> PropDown
>
> PropUp
>
> Direction
>
> PValue
>
> FDR
>
> PValue.Mixed
>
> FDR.Mixed
>
> 112
>
> 0.455357
>
> 0.205357
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 38
>
> 0.578947
>
> 0.210526
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 10
>
> 0.2
>
> 0.4
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 311
>
> 0.299035
>
> 0.469453
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 540
>
> 0.233333
>
> 0.344444
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 1294
>
> 0.328439
>
> 0.257342
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 317
>
> 0.29653
>
> 0.533123
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 538
>
> 0.421933
>
> 0.256506
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 133
>
> 0.511278
>
> 0.293233
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 39
>
> 0.589744
>
> 0.205128
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 14
>
> 0.214286
>
> 0.571429
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 13
>
> 0.307692
>
> 0.538462
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 36
>
> 0.472222
>
> 0.222222
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 616
>
> 0.160714
>
> 0.688312
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 6
>
> 1
>
> 0
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 21
>
> 0.428571
>
> 0.285714
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 65
>
> 0.415385
>
> 0.246154
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 99
>
> 0.383838
>
> 0.323232
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 19
>
> 0
>
> 0.578947
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 118
>
> 0.5
>
> 0.313559
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 470
>
> 0.461702
>
> 0.323404
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 1401
>
> 0.404711
>
> 0.250535
>
> Down
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 631
>
> 0.272583
>
> 0.369255
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 55
>
> 0.236364
>
> 0.472727
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
> 5
>
> 0.2
>
> 0.6
>
> Up
>
> 2.00E-04
>
> 0.0002
>
> 1.00E-04
>
> 5.50E-05
>
>
>
> On the other hand, when I ran roast for the contrast with few genes differentially expressed, I got few gene sets enriched.  But what's odd is it did find some gene sets enriched with FDR.Mixed < 0.05, but none of the genes in the gene set were differentially expressed.  Are these enriched gene sets false positives?  I'm not sure what's going on here.
>
> NGenes
>
> PropDown
>
> PropUp
>
> Direction
>
> PValue
>
> FDR
>
> PValue.Mixed
>
> FDR.Mixed
>
> # DE genes
>
> 18
>
> 0.333333
>
> 0
>
> Down
>
> 2.00E-04
>
> 0.01636
>
> 4.00E-04
>
> 0.024994
>
> 0
>
> 5
>
> 0.4
>
> 0
>
> Down
>
> 2.00E-04
>
> 0.01636
>
> 5.00E-04
>
> 0.024994
>
> 0
>
> 7
>
> 0.714286
>
> 0
>
> Down
>
> 6.00E-04
>
> 0.045444
>
> 6.00E-04
>
> 0.024994
>
> 0
>
> 50
>
> 0.08
>
> 0.32
>
> Up
>
> 0.013
>
> 0.159882
>
> 0.001
>
> 0.032379
>
> 0
>
> 49
>
> 0.346939
>
> 0.142857
>
> Down
>
> 0.049
>
> 0.272132
>
> 0.001
>
> 0.032379
>
> 0
>
> 4
>
> 0.25
>
> 0.25
>
> Down
>
> 0.1722
>
> 0.457071
>
> 0.0012
>
> 0.037628
>
> 0
>
>
> Please advise.
>
> Thanks,
> Julie
>
>
> Julie Leonard
> Computational Biologist
> Global Bioinformatics
> Syngenta Biotechnology, Inc.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list