[BioC] Inconsistency in GOstats results after GO enrichment

Chanchal Kumar chanchal at biochem.mpg.de
Tue Mar 4 19:39:12 CET 2008


Hi Robert,

   That peculiar p-value was for the sake of reproducing my concern which has eventually turned out to be an algorithmic implication due to conditional testing. And at that time was chosen to just get some GO terms which were common but having different category sizes. 

I totally agree with you that such a p-value of 1 or similar will not lead to anything interpretable and in practice I generally use p-val 0.01(or less) cutoff and have got very sensible and interesting results earlier using GOstats. 

On another note please accept my heartiest congratulations on winning the Benjamin Franklin Award 2008. Indeed great news for the Bioconductor and R community!

Best Regards,
Chanchal 
===============================
Chanchal Kumar, Ph.D. Candidate
Dept. of Proteomics and Signal Transduction
Max Planck Institute of Biochemistry
Am Klopferspitz 18
82152 D-Martinsried (near Munich)
Germany
e-mail: chanchal at biochem.mpg.de
Phone: (Office) +49 (0) 89 8578 2296
Fax:(Office) +49 (0) 89 8578 2219
http://www.biochem.mpg.de/mann/
===============================


-----Original Message-----
From: Robert Gentleman [mailto:rgentlem at fhcrc.org] 
Sent: Dienstag, 4. März 2008 16:41
To: Chanchal Kumar
Cc: Tony Chiang; James W. MacDonald; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment

And perhaps to make things a bit clearer:

Chanchal Kumar wrote:
> Hi Tony, James and All,
> 
>     This mail is to inform you that this issue is now cleared. As per what I have been conveyed by Robert Gentleman through James the explanation is as follows:
> 
> The conditioning has to happen on both the GO terms for the selected genes as well as for the GO terms from the universe. Hence one can get different sizes depending on the input gene IDs. If you re-run the example (provided by James) given below setting conditional=FALSE, you will see that the size values always stay the same.
> 
> Thanks to all of you for looking into this and helping me understand the concept and its implications.
> 
> James code:
> ##############################################
> set.seed(12345)
> set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
> set.seed(54321)
> set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
> univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2"))
> p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ, 
> annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP")


  this is a particularly peculiar choice of pvalueCutoff (pointed out by 
Seth Falcon), as it means that everything is significant (since all 
p-values are by definition less than 1), so that the all children are 
considered significant and the genes annotated there removed at each 
step.  I don't think anything interpretable comes from such an analysis.




> hyp <- hyperGTest(p)
> first <- summary(hyp, categorySize=10)
> p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ, 
> annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP")
> hyp <- hyperGTest(p)
> second <- summary(hyp, categorySize=10)
> ind <- first[,1] %in% second[,1]
> data.frame(first[ind,c(1,6)], second[first[ind,1],6])
> sum(first[ind,6] != second[first[ind,1],6])
> 
> ###############################################
> 
> Best Regards,
> Chanchal 
> ===============================
> Chanchal Kumar, Ph.D. Candidate
> Dept. of Proteomics and Signal Transduction
> Max Planck Institute of Biochemistry
> Am Klopferspitz 18
> 82152 D-Martinsried (near Munich)
> Germany
> e-mail: chanchal at biochem.mpg.de
> Phone: (Office) +49 (0) 89 8578 2296
> Fax:(Office) +49 (0) 89 8578 2219
> http://www.biochem.mpg.de/mann/
> ===============================
> ________________________________________
> From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang
> Sent: Tuesday, March 04, 2008 10:57 AM
> To: Chanchal Kumar
> Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment
> 
> Hi Chanchal,
> 
> I have not had a chance yet to look over your code. I will do so in the next couple of days and get you a definitive answer if I can figure it out. From the gut though, I suspect that I if remember Seth's implementation correctly, he looks at the test set and universe set with respect to the annotation package. If there is anything in the test or universe has yet to be annotated in the package, these genes are removed before the analysis takes place. I don't really know if *this* is the issue with which you are concerned, but like I said, I will have a look at get back to you asap.
> 
> Cheers,
> --tc
> 
> On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at biochem.mpg.de> wrote:
> Hi James, Tony and All,
> 
>   Thanks for your inputs and now I have assembled a small dataset which will help you reproduce the inconsistency in GO enrichment output using GOstats package.
> 
> To summarize, my concern is when I compare any two "test" datasets against the common "universe" set then I get different "Category Size" output of "universe" for common GO terms across these two "test" datasets. The .txt data files (test1, test2, universe) are attached with the mail and the script is below. The ids are EntrezID for Human and may be redundant in each file. But then I am removing the redundancy before calculating enrichment. Also I am using a p-value cutoff of 1 just to get everything.
> 
> After analysis I find discrepancies in these GO terms (there are more):
> GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325, GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810, GO:0007243, GO:0048878
> 
> I look forward to your inputs. In rest of the mail sessionInfo() is followed by script.
> 
>>  sessionInfo()
> R version 2.6.2 (2008-02-08)
> i386-pc-mingw32
> 
> locale:
> LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252
> 
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
>  [1] hgu95av2.db_2.0.2   GOstats_2.4.0       Category_2.4.0      genefilter_1.16.0   survival_2.34       RBGL_1.14.0
>  [7] annotate_1.16.1     xtable_1.5-2        GO.db_2.0.2         AnnotationDbi_1.0.6 RSQLite_0.6-7       DBI_0.2-4
> [13] Biobase_1.16.3      graph_1.16.1
> 
> loaded via a namespace (and not attached):
> [1] cluster_1.11.9
> 
> ####################SCRIPT STARTS###################### ######################################################
> # Load packages
> library("GOstats")
> library("hgu95av2.db")
> 
> 
> # read the universe set
> universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses = "character")
> universe<-as.vector(universe[,1])
> mode(universe)<-"character"
> 
> ########## Analysing Test 1 ##################
> 
> # read the test set 1
> test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses = "character")
> test<-as.vector(test[,1])
> mode(test)<-"character"
> 
> hgCutoff <- 1
> 
> #testing for GO enrichment using conditional test
> 
> params <- new("GOHyperGParams", geneIds = unique(test),
> universeGeneIds = unique(universe), annotation = "hgu95av2",
> ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
> testDirection = "over")
> 
> hgOver <- hyperGTest(params)
> 
> # Choose only those categories which have at least 3 members in the universe set
> d<-summary(hgOver,categorySize =3)
> write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=F)
> 
> 
> ########## Analysing Test 2 ##################
> # read the test set 2
> test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses = "character")
> test<-as.vector(test[,1])
> mode(test)<-"character"
> 
> hgCutoff <- 1
> 
> #testing for GO enrichment using conditional test
> 
> params <- new("GOHyperGParams", geneIds = unique(test),
> universeGeneIds = unique(universe), annotation = "hgu95av2",
> ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
> testDirection = "over")
> 
> hgOver <- hyperGTest(params)
> 
> # Choose only those categories which have at least 3 members in the universe set
> d<-summary(hgOver,categorySize =3)
> write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=F)
> 
> ##########################SCRIPT ENDS##########################
> ###############################################################
> 
> 
> Best Regards,
> Chanchal
> ===============================
> Chanchal Kumar, Ph.D. Candidate
> Dept. of Proteomics and Signal Transduction
> Max Planck Institute of Biochemistry
> Am Klopferspitz 18
> 82152 D-Martinsried (near Munich)
> Germany
> e-mail: chanchal at biochem.mpg.de
> Phone: (Office) +49 (0) 89 8578 2296
> Fax:(Office) +49 (0) 89 8578 2219
> http://www.biochem.mpg.de/mann/
> ===============================
> ________________________________________
> From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang
> Sent: Friday, February 29, 2008 3:01 PM
> To: James W. MacDonald
> Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch
> Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment
> Hi Chanchal,
>  
> In addition to the small example, please provide the list with your output of sessionInfo(). This way, when we try to reproduce your example, we will try to use the same version of software packages that you are using.
>  
> Cheers,
> --Tony
> 
>  
> On 2/29/08, James W. MacDonald <jmacdon at med.umich.edu> wrote:
> Hi Chanchal,
> 
> Unless you can give us a small reproducible example that shows what you
> are talking about, there is really nothing anybody can do about this.
> 
> Best,
> 
> Jim
> 
> 
> 
> Chanchal Kumar wrote:
>> Dear All,
>>
>>
>>
>>     I have been using the "GOstats" package for the GO enrichment on my
>> dataset. I have two cluster of genes(test1, test2) and a reference
>> set(universe set). After doing enrichment on the two sets for GO BP
>> w.r.t to the universe set I get a list of enriched terms. In both cases
>> I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine. But
>> strangely I see that the Category Size ( "Size" column of the output )
>> gives me different numbers for Cell cycle annotated genes in my universe
>> set( in test1 77, in test2 366). Now my concern is that given I use the
>> same universe set how I can get different numbers of Cell cycle
>> annotated universe genes? This is just one example and I see these
>> inconsistencies for many common terms.
>>
>>
>>
>> Any insights on this problem will be very helpful.
>>
>>
>>
>> Best Regards,
>>
>> Chanchal
>>
>> ===============================
>>
>> Chanchal Kumar, Ph.D. Candidate
>>
>> Dept. of Proteomics and Signal Transduction
>> Max Planck Institute of Biochemistry
>> Am Klopferspitz 18
>> 82152 D-Martinsried (near Munich)
>> Germany
>> e-mail: chanchal at biochem.mpg.de
>> Phone: (Office) +49 (0) 89 8578 2296
>>
>> Fax:(Office) +49 (0) 89 8578 2219
>> http://www.biochem.mpg.de/mann/
>> ===============================
>>
>>
>>
>>
>>        [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> --
> James W. MacDonald, M.S.
> Biostatistician
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 

-- 
Robert Gentleman, PhD
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
PO Box 19024
Seattle, Washington 98109-1024
206-667-7700
rgentlem at fhcrc.org



More information about the Bioconductor mailing list