[BioC] Inconsistency in GOstats results after GO enrichment

Tue Mar 4 11:07:27 CET 2008

Hi Tony, James and All,

    This mail is to inform you that this issue is now cleared. As per what I have been conveyed by Robert Gentleman through James the explanation is as follows:

The conditioning has to happen on both the GO terms for the selected genes as well as for the GO terms from the universe. Hence one can get different sizes depending on the input gene IDs. If you re-run the example (provided by James) given below setting conditional=FALSE, you will see that the size values always stay the same.

Thanks to all of you for looking into this and helping me understand the concept and its implications.

James code:
##############################################
set.seed(12345)
set1 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
set.seed(54321)
set2 <- unique(getEG(sample(ls(hgu95av2GO), 100), "hgu95av2"))
univ <- unique(getEG(ls(hgu95av2GO), "hgu95av2"))
p <- new("GOHyperGParams", geneIds=set1, universeGeneIds=univ, 
annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP")
hyp <- hyperGTest(p)
first <- summary(hyp, categorySize=10)
p <- new("GOHyperGParams", geneIds=set2, universeGeneIds=univ, 
annotation="hgu95av2", conditional=TRUE, pvalueCutoff=1, ontology="BP")
hyp <- hyperGTest(p)
second <- summary(hyp, categorySize=10)
ind <- first[,1] %in% second[,1]
data.frame(first[ind,c(1,6)], second[first[ind,1],6])
sum(first[ind,6] != second[first[ind,1],6])

###############################################

Best Regards,
Chanchal 
===============================
Chanchal Kumar, Ph.D. Candidate
Dept. of Proteomics and Signal Transduction
Max Planck Institute of Biochemistry
Am Klopferspitz 18
82152 D-Martinsried (near Munich)
Germany
e-mail: chanchal at biochem.mpg.de
Phone: (Office) +49 (0) 89 8578 2296
Fax:(Office) +49 (0) 89 8578 2219
http://www.biochem.mpg.de/mann/
===============================
________________________________________
From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang
Sent: Tuesday, March 04, 2008 10:57 AM
To: Chanchal Kumar
Cc: James W. MacDonald; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment

Hi Chanchal,

I have not had a chance yet to look over your code. I will do so in the next couple of days and get you a definitive answer if I can figure it out. From the gut though, I suspect that I if remember Seth's implementation correctly, he looks at the test set and universe set with respect to the annotation package. If there is anything in the test or universe has yet to be annotated in the package, these genes are removed before the analysis takes place. I don't really know if *this* is the issue with which you are concerned, but like I said, I will have a look at get back to you asap.

Cheers,
--tc

On Sat, Mar 1, 2008 at 2:53 PM, Chanchal Kumar <chanchal at biochem.mpg.de> wrote:
Hi James, Tony and All,

  Thanks for your inputs and now I have assembled a small dataset which will help you reproduce the inconsistency in GO enrichment output using GOstats package.

To summarize, my concern is when I compare any two "test" datasets against the common "universe" set then I get different "Category Size" output of "universe" for common GO terms across these two "test" datasets. The .txt data files (test1, test2, universe) are attached with the mail and the script is below. The ids are EntrezID for Human and may be redundant in each file. But then I am removing the redundancy before calculating enrichment. Also I am using a p-value cutoff of 1 just to get everything.

After analysis I find discrepancies in these GO terms (there are more):
GO:0016070 ,GO:0051179 ,GO:0006996, GO:0000074 , GO:0031325, GO:0007399, GO:0006464, GO:0030097, GO:0007586, GO:0006810, GO:0007243, GO:0048878

I look forward to your inputs. In rest of the mail sessionInfo() is followed by script.

>  sessionInfo()
R version 2.6.2 (2008-02-08)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252

attached base packages:
[1] splines   tools     stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] hgu95av2.db_2.0.2   GOstats_2.4.0       Category_2.4.0      genefilter_1.16.0   survival_2.34       RBGL_1.14.0
 [7] annotate_1.16.1     xtable_1.5-2        GO.db_2.0.2         AnnotationDbi_1.0.6 RSQLite_0.6-7       DBI_0.2-4
[13] Biobase_1.16.3      graph_1.16.1

loaded via a namespace (and not attached):
[1] cluster_1.11.9

####################SCRIPT STARTS###################### ######################################################
# Load packages
library("GOstats")
library("hgu95av2.db")

# read the universe set
universe<-read.delim(file="universe.txt",sep="\t",header=T, colClasses = "character")
universe<-as.vector(universe[,1])
mode(universe)<-"character"

########## Analysing Test 1 ##################

# read the test set 1
test<-read.delim(file="test1.txt",sep="\t",header=T, colClasses = "character")
test<-as.vector(test[,1])
mode(test)<-"character"

hgCutoff <- 1

#testing for GO enrichment using conditional test

params <- new("GOHyperGParams", geneIds = unique(test),
universeGeneIds = unique(universe), annotation = "hgu95av2",
ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
testDirection = "over")

hgOver <- hyperGTest(params)

# Choose only those categories which have at least 3 members in the universe set
d<-summary(hgOver,categorySize =3)
write.table(d,"Test1_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=F)

########## Analysing Test 2 ##################
# read the test set 2
test<-read.delim(file="test2.txt",sep="\t",header=T, colClasses = "character")
test<-as.vector(test[,1])
mode(test)<-"character"

hgCutoff <- 1

#testing for GO enrichment using conditional test

params <- new("GOHyperGParams", geneIds = unique(test),
universeGeneIds = unique(universe), annotation = "hgu95av2",
ontology = "BP", pvalueCutoff = hgCutoff, conditional = TRUE,
testDirection = "over")

hgOver <- hyperGTest(params)

# Choose only those categories which have at least 3 members in the universe set
d<-summary(hgOver,categorySize =3)
write.table(d,"Test2_GOBP.txt",row.names=F,col.names=T,sep="\t",quote=F)

##########################SCRIPT ENDS##########################
###############################################################

Best Regards,
Chanchal
===============================
Chanchal Kumar, Ph.D. Candidate
Dept. of Proteomics and Signal Transduction
Max Planck Institute of Biochemistry
Am Klopferspitz 18
82152 D-Martinsried (near Munich)
Germany
e-mail: chanchal at biochem.mpg.de
Phone: (Office) +49 (0) 89 8578 2296
Fax:(Office) +49 (0) 89 8578 2219
http://www.biochem.mpg.de/mann/
===============================
________________________________________
From: tc.fhcrc at gmail.com [mailto:tc.fhcrc at gmail.com] On Behalf Of Tony Chiang
Sent: Friday, February 29, 2008 3:01 PM
To: James W. MacDonald
Cc: Chanchal Kumar; bioconductor at stat.math.ethz.ch
Subject: Re: [BioC] Inconsistency in GOstats results after GO enrichment
Hi Chanchal,

In addition to the small example, please provide the list with your output of sessionInfo(). This way, when we try to reproduce your example, we will try to use the same version of software packages that you are using.

Cheers,
--Tony

On 2/29/08, James W. MacDonald <jmacdon at med.umich.edu> wrote:
Hi Chanchal,

Unless you can give us a small reproducible example that shows what you
are talking about, there is really nothing anybody can do about this.

Best,

Jim

Chanchal Kumar wrote:
> Dear All,
>
>
>
>    I have been using the "GOstats" package for the GO enrichment on my
> dataset. I have two cluster of genes(test1, test2) and a reference
> set(universe set). After doing enrichment on the two sets for GO BP
> w.r.t to the universe set I get a list of enriched terms. In both cases
> I get "Cell Cycle" (GO:0007049) as enriched. Till there it's fine. But
> strangely I see that the Category Size ( "Size" column of the output )
> gives me different numbers for Cell cycle annotated genes in my universe
> set( in test1 77, in test2 366). Now my concern is that given I use the
> same universe set how I can get different numbers of Cell cycle
> annotated universe genes? This is just one example and I see these
> inconsistencies for many common terms.
>
>
>
> Any insights on this problem will be very helpful.
>
>
>
> Best Regards,
>
> Chanchal
>
> ===============================
>
> Chanchal Kumar, Ph.D. Candidate
>
> Dept. of Proteomics and Signal Transduction
> Max Planck Institute of Biochemistry
> Am Klopferspitz 18
> 82152 D-Martinsried (near Munich)
> Germany
> e-mail: chanchal at biochem.mpg.de
> Phone: (Office) +49 (0) 89 8578 2296
>
> Fax:(Office) +49 (0) 89 8578 2219
> http://www.biochem.mpg.de/mann/
> ===============================
>
>
>
>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

_______________________________________________
Bioconductor mailing list
Bioconductor at stat.math.ethz.ch
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor