[BioC] GOstats, geneCounts and gene universe filtering...
Jesper Ryge
Jesper.Ryge at ki.se
Thu May 10 17:20:58 CEST 2007
thanks for the fast answer:-) its nice to know im battling my way in
the right direction...
here is the session info u requested (im using a mac powerPC G4 with
mac Os 10.4.9 if thats of any help...) :
> sessionInfo()
R version 2.5.0 (2007-04-23)
powerpc-apple-darwin8.9.1
locale:
C
attached base packages:
[1] "splines" "tools" "stats" "graphics" "grDevices" "utils"
[7] "datasets" "methods" "base"
other attached packages:
topGO SparseM GOstats Category
Matrix KEGG
"1.2.0" "0.72" "2.2.0" "2.2.2" "0.9975-11"
"1.16.0"
RBGL GO affy affyio rat2302
Rgraphviz
"1.12.0" "1.16.0" "1.14.0" "1.4.0" "1.16.0"
"1.14.0"
geneplotter lattice graph xtable RColorBrewer
genefilter
"1.14.0" "0.15-4" "1.14.0" "1.4-3" "0.2-3"
"1.14.1"
survival annotate Biobase
"2.31" "1.14.1" "1.14.0"
On 10 May 2007, at 16:53, Seth Falcon wrote:
> Hi Jesper,
>
> Jesper Ryge <Jesper.Ryge at ki.se> writes:
>> Im trying to perform an enrichment analysis for GO terms on my
>> microarray results. my problem arises when i noticed that the
>> geneCount(x) doesnt match the amount of genes annotated at certain
>> nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if
>> that is actually ok or if i missed something? i thought the geneCount
>> was the number of interesting genes (from the list fed to geneIds)
>> that belongs to a particular GO term and that geneIdsByCategory
>> should list those genes, i.e the numbers should match? this turned
>> out not to be the case on at least two of the GO nodes in the list of
>> significant over-represented GO terms:
>>
>>> length(geneIdsByCategory(test)[["GO:0051179"]])
>> [1] 89
>>> geneCounts(test)["GO:0051179"]
>> GO:0051179
>> 20
>>> length(geneIdsByCategory(test)[["GO:0007409"]])
>> [1] 13
>>> geneCounts(test)["GO:0007409"]
>> GO:0007409
>> 6
>> test is the output from hyperGTest(params), a conditional test for
>> over representation on the rat2302 chip.
>>
>> As i said i might have missed something, but it puzzles me somewhat.
>> comments welcome:-)
>
> This doesn't look right to me either. Can you please send your
> sessionInfo() so I'm certain what versions of things you are using? I
> suspect there is a bug in how these functions handle the conditional
> case.
>
>> As a "bonus" question i was wondering if there is any consensus
>> regarding filtering the gene universe before doing the GO enrichment
>> analysis? i know its recommended in the GOstats manual, for instance
>> by removing probe sets with little variation across samples using IQR
>> (or some similar measure). but in the topGO package by adrian Alexa
>> they seems to care little about this issue and use all GO annotated
>> probe sets from the chip used in the particular study.
>
> Perhaps that answers your question: there is not widespread consensus.
>
>> i was wondering, if u reduce the set of genes from the gene universe
>> (n.GU) dont u also reduce the amount of genes annotated (n.GA) to
>> each go term and most likely the amount of interesting genes (n.GI)
>
> I think of the filtering process as part of the definition of
> "interesting gene". So a gene that doesn't pass the non-specific
> filtering is by definition not interesting and doesn't make it into
> the selected gene list.
>
> Yes, non-specific filtering will reduce the set of genes annotated at
> some GO terms, but this is desired IMO.
>
>> - at least in my case some of the genes thats filtered out by IQR
>> were classified as significantly differentíally expressed by cyberT
>> or limma on the full data set. So what im asking here is: doesn't
>> n.GI and n.GA depend on and change as a function of n.GU? at least
>> when u use coarse grained filtering methods it seems that this is
>> the case and u might loose some interesting genes and in effect
>> throw out the baby with the tub-water - so to speak?
>>
>> put in (yet) another way: the chance at GO node X of getting n.GI
>> [X] interesting genes out of the all annotated genes n.GA[X] at that
>> node by sampling n.GI genes from n.GU at random tells u something
>> about the chance of enrichment at node X. i hope i got that part
>> right? but if n.GI and n.GA depends on n.GU this chance of
>> erinchement might not change drastically when u reduce the gene
>> universe with some coarse grained variance method? or?
>
> I think you are on the right track. Filtering should change the
> results, otherwise, why would you filter? The question at hand is
> whether it is appropriate to include all genes annotated at a given GO
> term when testing that term. There is consensus (I hope) that genes
> that were not tested in the experiment should be removed.
> Non-specific filtering gives you a chance to remove additional genes
> that were tested, but appear to provide no information about the
> samples. My experience is that you get more conservative results by
> reducing the gene universe as much as possible. If you play with
> phyper a bit, I suspect you will come to a similar conclusion.
>
> + seth
>
> --
> Seth Falcon | Computational Biology | Fred Hutchinson Cancer
> Research Center
> http://bioconductor.org
More information about the Bioconductor
mailing list