[BioC] GOstats, geneCounts and gene universe filtering...

Thu May 10 17:20:58 CEST 2007

thanks for the fast answer:-) its nice to know im battling my way in  
the right direction...

here is the session info u requested (im using a mac powerPC G4 with  
mac Os 10.4.9 if thats of any help...) :

 > sessionInfo()
R version 2.5.0 (2007-04-23)
powerpc-apple-darwin8.9.1

locale:
C

attached base packages:
[1] "splines"   "tools"     "stats"     "graphics"  "grDevices" "utils"
[7] "datasets"  "methods"   "base"

other attached packages:
        topGO      SparseM      GOstats     Category        
Matrix         KEGG
      "1.2.0"       "0.72"      "2.2.0"      "2.2.2"  "0.9975-11"      
"1.16.0"
         RBGL           GO         affy       affyio      rat2302     
Rgraphviz
     "1.12.0"     "1.16.0"     "1.14.0"      "1.4.0"     "1.16.0"      
"1.14.0"
geneplotter      lattice        graph       xtable RColorBrewer    
genefilter
     "1.14.0"     "0.15-4"     "1.14.0"      "1.4-3"      "0.2-3"      
"1.14.1"
     survival     annotate      Biobase
       "2.31"     "1.14.1"     "1.14.0"

On 10 May 2007, at 16:53, Seth Falcon wrote:

> Hi Jesper,
>
> Jesper Ryge <Jesper.Ryge at ki.se> writes:
>> Im trying to perform an enrichment analysis for GO terms on my
>> microarray results. my problem arises when i noticed that the
>> geneCount(x) doesnt match the amount  of genes annotated at certain
>> nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if
>> that is actually ok or if i missed something? i thought the geneCount
>> was the number of interesting genes (from the list fed to geneIds)
>> that belongs to a particular GO term and that geneIdsByCategory
>> should list those genes, i.e the numbers should match?  this turned
>> out not to be the case on at least two of the GO nodes in the list of
>> significant over-represented GO terms:
>>
>>> length(geneIdsByCategory(test)[["GO:0051179"]])
>> [1] 89
>>> geneCounts(test)["GO:0051179"]
>> GO:0051179
>>          20
>>> length(geneIdsByCategory(test)[["GO:0007409"]])
>> [1] 13
>>> geneCounts(test)["GO:0007409"]
>> GO:0007409
>>           6
>> test is the output from hyperGTest(params),  a conditional test for
>> over representation on the rat2302 chip.
>>
>> As i said i might have missed something, but it puzzles me somewhat.
>> comments welcome:-)
>
> This doesn't look right to me either.  Can you please send your
> sessionInfo() so I'm certain what versions of things you are using?  I
> suspect there is a bug in how these functions handle the conditional
> case.
>
>> As a "bonus" question i was wondering if there is any consensus
>> regarding filtering the gene universe before doing the GO enrichment
>> analysis? i know its recommended in the GOstats manual, for instance
>> by removing probe sets with little variation across samples using IQR
>> (or some similar measure). but in the topGO package by adrian Alexa
>> they seems to care little about this issue and use all GO annotated
>> probe sets from the chip used in the particular study.
>
> Perhaps that answers your question: there is not widespread consensus.
>
>> i was wondering, if u reduce the set of genes from the gene universe
>> (n.GU) dont u also reduce the amount of genes annotated (n.GA) to
>> each go term and most likely the amount of interesting genes (n.GI)
>
> I think of the filtering process as part of the definition of
> "interesting gene".  So a gene that doesn't pass the non-specific
> filtering is by definition not interesting and doesn't make it into
> the selected gene list.
>
> Yes, non-specific filtering will reduce the set of genes annotated at
> some GO terms, but this is desired IMO.
>
>> - at least in my case some of the genes thats filtered out by IQR
>> were classified as significantly differentíally expressed by cyberT
>> or limma on the full data set.  So what im asking here is: doesn't
>> n.GI and n.GA depend on and change as a function of n.GU? at least
>> when u use coarse grained filtering methods it seems that this is
>> the case and u might loose some interesting genes and in effect
>> throw out the baby with the tub-water - so to speak?
>>
>>   put in (yet) another way: the chance at GO node X  of getting n.GI
>> [X] interesting genes out of the all annotated genes n.GA[X] at that
>> node by sampling n.GI genes from n.GU at random tells u something
>> about the chance of enrichment at node X. i hope i got that part
>> right? but if n.GI and n.GA depends on n.GU this chance of
>> erinchement might not change drastically when u reduce the gene
>> universe with some coarse grained variance method? or?
>
> I think you are on the right track.  Filtering should change the
> results, otherwise, why would you filter?  The question at hand is
> whether it is appropriate to include all genes annotated at a given GO
> term when testing that term.  There is consensus (I hope) that genes
> that were not tested in the experiment should be removed.
> Non-specific filtering gives you a chance to remove additional genes
> that were tested, but appear to provide no information about the
> samples.  My experience is that you get more conservative results by
> reducing the gene universe as much as possible.  If you play with
> phyper a bit, I suspect you will come to a similar conclusion.
>
> + seth
>
> -- 
> Seth Falcon | Computational Biology | Fred Hutchinson Cancer  
> Research Center
> http://bioconductor.org