[BioC] GOstats - zebrafish
James W. MacDonald
jmacdon at med.umich.edu
Mon May 10 20:15:18 CEST 2010
Hi Neel,
Neel Aluru wrote:
> Hello BioC users,
>
> My question is pretty vague, so please bear with me. I want to do
> Gene set enrichment analysis (GSEA) on zebrafish agilent array data.
> I read the user guide and vignettes but still it is not quite clear
> to me how to proceed with it. What I have with regard to annotation
> of my probes is gene name, gene bank accession numbers and probe
> sequences. Using this information, is it possible to get GO
> annotations for my probes with any of the packages available.
> Zebrafish annotation package (org.Dr.eg.db) has objects such as
> org.Dr.egGO2EG, org.Dr.egGO etc. Is it possible to map with them and
> then you those data in GSEA? Most of the examples are on affymetrix
> data and I cannot seem to find literature where it is used on agilent
> arrays.
It's pretty simple, actually. You just need two things; a set of Entrez
Gene IDs that represent the set of genes that you are calling
differentially expressed, and a set of Entrez Gene IDs that represents
the unique set of genes that the chip interrogates.
Here I am assuming that you want to do a Fisher's exact test (which I
guess is technically a GSEA, but not commonly called that).
Note that the org.Dr.eg.db package is based on Entrez Gene IDs (e.g.,
the canonical mapping is from EG --> whatever). Since you have accession
numbers, we want to use the org.Dr.egACCNUM table. In addition, we need
to reverse the mappings to be EG <-- ACCNUM. Say your accession numbers
are in a vector called 'accnum'.
egs <- mget(accnum, revmap(org.Dr.egACCNUM), ifnotfound = NA)
Two things here. The revmap() function just switches things so we find
Entrez Gene IDs using accession numbers as input. We also say to give an
NA if the Entrez Gene ID isn't found.
This should be a many-to-one mapping, if I understand GenBank and
Entrez, so e.g., a given accession number should just point to one
Entrez Gene ID. However, it's nice to check.
all(sapply(egs, length) == 1)
Also, how many NAs are there? Assuming all lengths == 1, you can do
sum(sapply(egs, is.na))
otherwise you need
sum(sapply(egs, function(x) all(is.na(x))))
If there are any duplicate Entrez Gene IDs there, you will have to
decide which you want to use. If they are all length one, then
egs <- unique(unlist(egs))
We unique-ify this vector because a given gene can only be
differentially regulated once in a given sample.
if there are NAs, then
egs <- egs[!is.na(egs)]
Now you need a vector of Entrez Gene IDs that represents all genes on
the chip. I will assume you can get a vector of accession numbers for
this as well. I further assume it is a character vector called 'univ'.
We do the same rigamarole:
univ <- mget(univ, revmap(org.Dr.egACCNUM), ifnotfound = NA)
check that each list member is length one, if not, choose which EG ID
you like, get rid of NAs, and make unique. Then you proceed just like in
the vignette:
p <- new("GOHyperGParams", geneIds = egs, universeGeneIds = univ,
annotation = "org.Dr.eg.db", conditional = TRUE) #maybe other args
hypt <- hyperGtest(p)
summary(hypt)
Best,
Jim
>
> Any suggestions or comments will be appreciated.
>
> Thank you,
>
> Neel
>
> Neel Aluru Postdoctoral Scholar Biology Department Woods Hole
> Oceanographic Institution Woods Hole, MA 02543 USA 508-289-3607
>
> _______________________________________________ Bioconductor mailing
> list Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
> archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list