[BioC] GOStats: can't find probesets, which are in overrepresented GO term
James W. MacDonald
jmacdon at med.umich.edu
Fri Oct 1 15:43:14 CEST 2010
Hi Mike,
On 10/1/2010 8:07 AM, Mike Walter wrote:
> Dear BioC List,
>
> I have difficulties in understanding and retracing the results from GO overrepresentation analysis.
> I used GOTerms to see if there are any GO Terms overrepresented in a list of DE genes with following code:
>
>
> #define gene universe, select only probes with GO annotation and
> #transform probesetIDs in unique Entrez Ids
>>
>> input<- featureNames(d.norm.filt)
>> univ<- mget(input, mouse4302GO, ifnotfound=NA)
>> univ<- input[!is.na(univ)]
>> univ<- mget(univ, mouse4302ENTREZID, ifnotfound=NA)
>> univ<- unique(as.character(univ[!is.na(univ)]))
>
> #do the same for my list of genes (DE)
>>
>> ID<- mget(DE, mouse4302GO, ifnotfound=NA)
>> ID<- DE[!is.na(ID)]
>> ID<- mget(as.character(ID), mouse4302ENTREZID, ifnotfound=NA)
>> ID<- unique(as.character(ID[!is.na(ID)]))
>
> #run HyperGTest
>>
>> BP = new("GOHyperGParams",
> + geneIds=ID, universeGeneIds=univ,
> + annotation="mouse4302", ontology="BP",
> + pvalueCutoff=0.01, conditional=TRUE,
> + testDirection="over")
>> hGt.BP = hyperGTest(BP)
>> head(summary(hGt.BP))
>
> GOBPID Pvalue OddsRatio ExpCount Count Size
> 1 GO:0006944 0.0001406440 146.7895 0.0180309101 2 21
> 2 GO:0014049 0.0008586148 Inf 0.0008586148 1 1
> 3 GO:0032303 0.0008586148 Inf 0.0008586148 1 1
> 4 GO:0032308 0.0008586148 Inf 0.0008586148 1 1
> 5 GO:0034405 0.0008586148 Inf 0.0008586148 1 1
> 6 GO:0043132 0.0008586148 Inf 0.0008586148 1 1
> Term
> 1 cellular membrane fusion
> 2 positive regulation of glutamate secretion
> 3 regulation of icosanoid secretion
> 4 positive regulation of prostaglandin secretion
> 5 response to fluid shear stress
> 6 NAD transport
>
> When I now want to know, which probesets of my DE list are in the first GO, I tried following chunk.
>
>
>> go = as.matrix(unlist(mget("GO:0006944", mouse4302GO2PROBE, ifnotfound=NA)))
>> DEinGO = DE[is.element(DE,go[,1])]
>> DEinGO
> character(0)
The problem here is you are using the GO2PROBE mapping, whereas you
should be using the GO2ALLPROBES mapping. The difference being that you
are looking only at probesets that map to a particular node in the GO
DAG, whereas the test you are using considers all probesets that map to
that node and all descendant nodes (child, grandchild, etc).
This is because a child node by definition is also a member of its
parent node. An example using a fake vocabulary:
Say you have a probeset that maps directly to 'MAP kinases'. This
probeset also maps indirectly to an ancestor term, 'Protein
Phosphorylation', as that is what a MAP kinase does. So you might have 5
probesets that map to 'MAP kinases', and 3 probesets that map to 'MAP
kinase kinases', but neither GO term is significant. However, 'Protein
phosphorylation' might be (due to these 8 probesets), but there are no
probesets in your significant set that map directly to protein
phosphorylation.
Anyway, you are doing a lot of work here that isn't really necessary
unless you really want a detailed understanding of what is going on.
This isn't a bad idea IMO, and I applaud anybody who wants to know more
about what they are doing than how to press the buttons, so if that's
the case, then bravo!
If you just want to press buttons, however, the correct button to press
would be probeSetSummary(), which will output all the probesets that are
contributing to each significant GO term. Perusing the function will
also help if you are trying to get a better understanding.
Best,
Jim
>
>
> However, this results in an empty vector. I then picked all probesets, which fall into this category,
> which results in only 5 probesets/2 genes.
>
>> revgo = mouse4302GO2PROBE
>> revgo = as.list(revgo[mappedkeys(revgo)])
>> revgo = lapply(revgo,as.vector)
>> probesInGO = as.vector(unlist(revgo["GO:0006944"))
>> mget(probesInGO, mouse4302ENTREZID)
> $`1417349_at`
> [1] "18457"
>
> $`1417350_at`
> [1] "18457"
>
> $`1457088_at`
> [1] "18457"
>
> $`1420833_at`
> [1] "22318"
>
> $`1420834_at`
> [1] "22318"
>
> None of the 5 probesets are in my DE list. So I have basically 2 questions:
>
> 1) Why is this GO overrepresented in my list? Or, where is the large error in my script, that results in this mistake?
>
> 2) Why is the size of the GO category in the summary 21, when there are only 5 probesets with this GO?
>
> Thank you very much for any ideas,
>
> Mike
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list