[BioC] GOstats, geneCounts and gene universe filtering...
Jesper Ryge
Jesper.Ryge at ki.se
Thu May 10 16:09:32 CEST 2007
Hi,
Im trying to perform an enrichment analysis for GO terms on my
microarray results. my problem arises when i noticed that the
geneCount(x) doesnt match the amount of genes annotated at certain
nodes using geneIdsByCategory(x). maybe thats ok, i just wondered if
that is actually ok or if i missed something? i thought the geneCount
was the number of interesting genes (from the list fed to geneIds)
that belongs to a particular GO term and that geneIdsByCategory
should list those genes, i.e the numbers should match? this turned
out not to be the case on at least two of the GO nodes in the list of
significant over-represented GO terms:
> length(geneIdsByCategory(test)[["GO:0051179"]])
[1] 89
> geneCounts(test)["GO:0051179"]
GO:0051179
20
> length(geneIdsByCategory(test)[["GO:0007409"]])
[1] 13
> geneCounts(test)["GO:0007409"]
GO:0007409
6
test is the output from hyperGTest(params), a conditional test for
over representation on the rat2302 chip.
As i said i might have missed something, but it puzzles me somewhat.
comments welcome:-)
As a "bonus" question i was wondering if there is any consensus
regarding filtering the gene universe before doing the GO enrichment
analysis? i know its recommended in the GOstats manual, for instance
by removing probe sets with little variation across samples using IQR
(or some similar measure). but in the topGO package by adrian Alexa
they seems to care little about this issue and use all GO annotated
probe sets from the chip used in the particular study. i was
wondering, if u reduce the set of genes from the gene universe (n.GU)
dont u also reduce the amount of genes annotated (n.GA) to each go
term and most likely the amount of interesting genes (n.GI) - at
least in my case some of the genes thats filtered out by IQR were
classified as significantly differentíally expressed by cyberT or
limma on the full data set. So what im asking here is: doesn't n.GI
and n.GA depend on and change as a function of n.GU? at least when u
use coarse grained filtering methods it seems that this is the case
and u might loose some interesting genes and in effect throw out the
baby with the tub-water - so to speak?
put in (yet) another way: the chance at GO node X of getting n.GI
[X] interesting genes out of the all annotated genes n.GA[X] at that
node by sampling n.GI genes from n.GU at random tells u something
about the chance of enrichment at node X. i hope i got that part
right? but if n.GI and n.GA depends on n.GU this chance of
erinchement might not change drastically when u reduce the gene
universe with some coarse grained variance method? or?
my preliminary test of filtering versus no filtering seems to show
that there is a rather little effect, most of the GO terms are
identical in both cases. Does that mean i should trust more those
terms that come up in both lists based on either filtered and
unfiltered gene universe? or should i prefer one list over the other
for some particular reason? it seems to me that the GO terms that are
more robust to changes in the gene universe are the most likely
candidates?
hm, i realise this became a little long. hope i explained it in way
that makes sense. sorry if i pose an already discussed issue, but i
couldn't seem to find any previous discussions on this. advice and
pointers most appreciated:-)
cheers,
jesper ryge
Phd Student,
Department of Neuroscience
Karolinska Institutet
More information about the Bioconductor
mailing list