[BioC] GOStats: can't find probesets, which are in overrepresented GO term

Fri Oct 1 15:43:14 CEST 2010

Hi Mike,

On 10/1/2010 8:07 AM, Mike Walter wrote:
> Dear BioC List,
>
> I have difficulties in understanding and retracing the results from GO overrepresentation analysis.
> I used GOTerms to see if there are any GO Terms overrepresented in a list of DE genes with following code:
>
>
> #define gene universe, select only probes with GO annotation and
> #transform probesetIDs in unique Entrez Ids
>>
>> input<- featureNames(d.norm.filt)
>> univ<- mget(input, mouse4302GO, ifnotfound=NA)
>> univ<- input[!is.na(univ)]
>> univ<- mget(univ, mouse4302ENTREZID, ifnotfound=NA)
>> univ<- unique(as.character(univ[!is.na(univ)]))
>
> #do the same for my list of genes (DE)
>>
>> ID<- mget(DE, mouse4302GO, ifnotfound=NA)
>> ID<- DE[!is.na(ID)]
>> ID<- mget(as.character(ID), mouse4302ENTREZID, ifnotfound=NA)
>> ID<- unique(as.character(ID[!is.na(ID)]))
>
> #run HyperGTest
>>
>> BP = new("GOHyperGParams",
> +    geneIds=ID, universeGeneIds=univ,
> +    annotation="mouse4302", ontology="BP",
> +    pvalueCutoff=0.01, conditional=TRUE,
> +    testDirection="over")
>> hGt.BP = hyperGTest(BP)
>> head(summary(hGt.BP))
>
>   GOBPID       Pvalue OddsRatio     ExpCount Count Size
> 1 GO:0006944 0.0001406440  146.7895 0.0180309101     2   21
> 2 GO:0014049 0.0008586148       Inf 0.0008586148     1    1
> 3 GO:0032303 0.0008586148       Inf 0.0008586148     1    1
> 4 GO:0032308 0.0008586148       Inf 0.0008586148     1    1
> 5 GO:0034405 0.0008586148       Inf 0.0008586148     1    1
> 6 GO:0043132 0.0008586148       Inf 0.0008586148     1    1
>   Term
> 1                       cellular membrane fusion
> 2     positive regulation of glutamate secretion
> 3              regulation of icosanoid secretion
> 4 positive regulation of prostaglandin secretion
> 5                 response to fluid shear stress
> 6                                  NAD transport
>
> When I now want to know, which probesets of my DE list are in the first GO, I tried following chunk.
>
>
>> go = as.matrix(unlist(mget("GO:0006944", mouse4302GO2PROBE, ifnotfound=NA)))
>> DEinGO = DE[is.element(DE,go[,1])]
>> DEinGO
> character(0)

The problem here is you are using the GO2PROBE mapping, whereas you 
should be using the GO2ALLPROBES mapping. The difference being that you 
are looking only at probesets that map to a particular node in the GO 
DAG, whereas the test you are using considers all probesets that map to 
that node and all descendant nodes (child, grandchild, etc).

This is because a child node by definition is also a member of its 
parent node. An example using a fake vocabulary:

Say you have a probeset that maps directly to 'MAP kinases'. This 
probeset also maps indirectly to an ancestor term, 'Protein 
Phosphorylation', as that is what a MAP kinase does. So you might have 5 
probesets that map to 'MAP kinases', and 3 probesets that map to 'MAP 
kinase kinases', but neither GO term is significant. However, 'Protein 
phosphorylation' might be (due to these 8 probesets), but there are no 
probesets in your significant set that map directly to protein 
phosphorylation.

Anyway, you are doing a lot of work here that isn't really necessary 
unless you really want a detailed understanding of what is going on. 
This isn't a bad idea IMO, and I applaud anybody who wants to know more 
about what they are doing than how to press the buttons, so if that's 
the case, then bravo!

If you just want to press buttons, however, the correct button to press 
would be probeSetSummary(), which will output all the probesets that are 
contributing to each significant GO term. Perusing the function will 
also help if you are trying to get a better understanding.

Best,

Jim

>
>
> However, this results in an empty vector. I then picked all probesets, which fall into this category,
> which results in only 5 probesets/2 genes.
>
>> revgo = mouse4302GO2PROBE
>> revgo = as.list(revgo[mappedkeys(revgo)])
>> revgo = lapply(revgo,as.vector)
>> probesInGO = as.vector(unlist(revgo["GO:0006944"))
>> mget(probesInGO, mouse4302ENTREZID)
> $`1417349_at`
> [1] "18457"
>
> $`1417350_at`
> [1] "18457"
>
> $`1457088_at`
> [1] "18457"
>
> $`1420833_at`
> [1] "22318"
>
> $`1420834_at`
> [1] "22318"
>
> None of the 5 probesets are in my DE list. So I have basically 2 questions:
>
> 1) Why is this GO overrepresented in my list? Or, where is the large error in my script, that results in this mistake?
>
> 2) Why is the size of the GO category in the summary 21, when there are only 5 probesets with this GO?
>
> Thank you very much for any ideas,
>
> Mike
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues