[BioC] GOstats, geneCounts and gene universe filtering...

Thu May 10 16:09:32 CEST 2007

Hi,

Im trying to perform an enrichment analysis for GO terms on my  
microarray results. my problem arises when i noticed that the  
geneCount(x) doesnt match the amount  of genes annotated at certain  
nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if  
that is actually ok or if i missed something? i thought the geneCount  
was the number of interesting genes (from the list fed to geneIds)  
that belongs to a particular GO term and that geneIdsByCategory  
should list those genes, i.e the numbers should match?  this turned  
out not to be the case on at least two of the GO nodes in the list of  
significant over-represented GO terms:

 > length(geneIdsByCategory(test)[["GO:0051179"]])
[1] 89
 > geneCounts(test)["GO:0051179"]
GO:0051179
         20
 > length(geneIdsByCategory(test)[["GO:0007409"]])
[1] 13
 > geneCounts(test)["GO:0007409"]
GO:0007409
          6
test is the output from hyperGTest(params),  a conditional test for  
over representation on the rat2302 chip.

As i said i might have missed something, but it puzzles me somewhat.  
comments welcome:-)

As a "bonus" question i was wondering if there is any consensus  
regarding filtering the gene universe before doing the GO enrichment  
analysis? i know its recommended in the GOstats manual, for instance  
by removing probe sets with little variation across samples using IQR  
(or some similar measure). but in the topGO package by adrian Alexa  
they seems to care little about this issue and use all GO annotated  
probe sets from the chip used in the particular study.   i was  
wondering, if u reduce the set of genes from the gene universe (n.GU)  
dont u also reduce the amount of genes annotated (n.GA) to each go  
term and most likely the amount of interesting genes (n.GI) - at  
least in my case some of the genes thats filtered out by IQR were  
classified as significantly differentíally expressed by cyberT or  
limma on the full data set.  So what im asking here is:  doesn't n.GI  
and n.GA depend on and change as a function of n.GU? at least when u  
use coarse grained filtering methods it seems that this is the case  
and u might loose some interesting genes and in effect throw out the  
baby with the tub-water - so to speak?

  put in (yet) another way: the chance at GO node X  of getting n.GI 
[X] interesting genes out of the all annotated genes n.GA[X] at that  
node by sampling n.GI genes from n.GU at random tells u something  
about the chance of enrichment at node X. i hope i got that part  
right? but if n.GI and n.GA depends on n.GU this chance of  
erinchement might not change drastically when u reduce the gene  
universe with some coarse grained variance method? or?

my preliminary test of filtering versus no filtering seems to show  
that there is a rather little effect, most of the GO terms are  
identical in both cases. Does that mean i should trust more those  
terms that come up in both lists based on either filtered and  
unfiltered gene universe? or should i prefer one list over the other  
for some particular reason? it seems to me that the GO terms that are  
more robust to changes in the gene universe are the most likely  
candidates?

hm, i realise this became a little long. hope i explained it in way  
that makes sense. sorry if i pose an already discussed issue, but i  
couldn't seem to find any previous discussions on this. advice and  
pointers most appreciated:-)

cheers,
jesper ryge
Phd Student,
Department of Neuroscience
Karolinska Institutet