[BioC] GOstats - zebrafish

Mon May 10 20:15:18 CEST 2010

Hi Neel,

Neel Aluru wrote:
> Hello BioC users,
> 
> My question is pretty vague, so please bear with me. I want to do
> Gene set enrichment analysis (GSEA) on zebrafish agilent array data.
> I read the user guide and vignettes but still it is not quite clear
> to me how to proceed with it. What I have with regard to annotation
> of my probes is gene name, gene bank accession numbers and probe
> sequences. Using this information, is it possible to get GO
> annotations for my probes with any of the packages available.
> Zebrafish annotation package (org.Dr.eg.db) has objects such as
> org.Dr.egGO2EG, org.Dr.egGO etc. Is it possible to map with them and
> then you those data in GSEA? Most of the examples are on affymetrix
> data and I cannot seem to find literature where it is used on agilent
> arrays.

It's pretty simple, actually. You just need two things; a set of Entrez 
Gene IDs that represent the set of genes that you are calling 
differentially expressed, and a set of Entrez Gene IDs that represents 
the unique set of genes that the chip interrogates.

Here I am assuming that you want to do a Fisher's exact test (which I 
guess is technically a GSEA, but not commonly called that).

Note that the org.Dr.eg.db package is based on Entrez Gene IDs (e.g., 
the canonical mapping is from EG --> whatever). Since you have accession 
numbers, we want to use the org.Dr.egACCNUM table. In addition, we need 
to reverse the mappings to be EG <-- ACCNUM. Say your accession numbers 
are in a vector called 'accnum'.

egs <- mget(accnum, revmap(org.Dr.egACCNUM), ifnotfound = NA)

Two things here. The revmap() function just switches things so we find 
Entrez Gene IDs using accession numbers as input. We also say to give an 
NA if the Entrez Gene ID isn't found.

This should be a many-to-one mapping, if I understand GenBank and 
Entrez, so e.g., a given accession number should just point to one 
Entrez Gene ID. However, it's nice to check.

all(sapply(egs, length) == 1)

Also, how many NAs are there? Assuming all lengths == 1, you can do

sum(sapply(egs, is.na))

otherwise you need

sum(sapply(egs, function(x) all(is.na(x))))

If there are any duplicate Entrez Gene IDs there, you will have to 
decide which you want to use. If they are all length one, then

egs <- unique(unlist(egs))

We unique-ify this vector because a given gene can only be 
differentially regulated once in a given sample.

if there are NAs, then

egs <- egs[!is.na(egs)]

Now you need a vector of Entrez Gene IDs that represents all genes on 
the chip. I will assume you can get a vector of accession numbers for 
this as well. I further assume it is a character vector called 'univ'.

We do the same rigamarole:

univ <- mget(univ, revmap(org.Dr.egACCNUM), ifnotfound = NA)

check that each list member is length one, if not, choose which EG ID 
you like, get rid of NAs, and make unique. Then you proceed just like in 
the vignette:

p <- new("GOHyperGParams", geneIds = egs, universeGeneIds = univ,
annotation = "org.Dr.eg.db", conditional = TRUE) #maybe other args

hypt <- hyperGtest(p)

summary(hypt)

Best,

Jim

> 
> Any suggestions or comments will be appreciated.
> 
> Thank you,
> 
> Neel
> 
> Neel Aluru Postdoctoral Scholar Biology Department Woods Hole
> Oceanographic Institution Woods Hole, MA 02543 USA 508-289-3607
> 
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch 
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues