[BioC] Help with annotation packages

James W. MacDonald jmacdon at med.umich.edu
Tue Mar 28 19:18:38 CEST 2006


Hi Amy,

Amy Mikhail wrote:
> Dear list,
> 
> I have a mosquito microarray that I would like to annotate, but am having
> some trouble figuring out which packages are appropriate to use.  After
> reading the Annbuilder, Annotate and BiomaRt vignettes, I am still really
> unsure if any of those packages would do what I want.  So here is my
> question:
> 
> The array is for Anopheles gambiae, and consists of about 13,500 cDNA
> spots from PCR plates - probe sequences between 150 and 500 bp in length. 
> The manufacturer of my array provided a .GAL file with it - this was made
> in GenePix and lists ensembl gene transcripts under the column "name" and
> ensembl gene identifiers under the column "ID".
> 
> What I would really like is to add an extra column to this .GAL file (or
> actually my .gpr outputs from GenePix) which would contain gene
> function/ontology information, so that everything I do with my results
> thereafter would come up with the GO information as well (e.g. toptable
> from limma).

I don't know if this is going to work out too well. There is a 
one-to-many relationship between e.g., Ensembl ID and GO terms 
(especially if you are using all three GO types), so it is not likely 
that you will be able to get all your GO terms to fit into a topTable.

However, you could use topTable() to get the most interesting genes, 
then extract the Ensembl IDs associated with them and use biomaRt to get 
the GO terms using the getGO() function. If you then want to look at 
your data along with the GO terms, you could use htmlpage() in the 
annotate package to make an HTML table where you *could* visualize 
things with a one-to-many relationship.

This will take some work on your part to figure things out, but the main 
workflow would be something like:

Run topTable() to get a vector of Ensembl IDs
Load biomaRt and set up a connection using useMart(). To figure out the 
dataset to use, I usually do

mart <- useMart("ensembl")
listDatasets(mart)
mart <- useMart("ensembl", dataset = "<name of dataset from above>")

Then use getGO() to get a data.frame containing all the GO terms. You 
can then use the 'ensembl_gene_id' column to parse out which GO terms 
belong to which ID. You want to make a list of the same length as your 
vector of Ensembl IDs and then stick the unique GO terms for each 
Ensembl ID into the corresponding position of the list.

For example, say you have a vector of Ensembl IDs. Then you would do 
something like this (not tested):

ens.ids <- <character vector of IDs>
go <- getGO(ens.ids, "ensembl", mart=mart)
mylist <- vector("list", length(ens.ids))
for(i in seq(along = ens.ids))
	mylist[[i]] <- unique(subset(go, go[,4] == ens.ids[i], select = 2))

You can then use htmlpage() in the annotate package to turn this into an 
HTML table. You won't be able to make links to databases currently 
because there isn't a function to make links to Ensembl. However, you 
could put the mylist from above along with vectors of Ensembl IDs, 
p-values, t-statistics, and a data.frame of your expression values into 
another list and use that for the 'othernames' argument to htmlpage().

As I mentioned above, this will take some work on your part since I have 
only sketched the basic idea here. However, this is likely the easiest 
way to go, as compared to building an annotation package.

HTH,

Jim


> 
> I know that the latest An. gambiae annotation available in Ensembl is
> agam_P3, and would like to use this but have to bear in mind that the
> microarray probe IDs were provided from an earlier build, so a number of
> genes on the array will not be present in the agam_P3 list .  If the
> package I use flags these as NAs or whatever, that would be fine for the
> moment.
> 
> My confusion is really over which package to use:
> 
> I understand that Biomart can handle single queries or queries for a small
> list of (e.g.) DE genes, but not the entire probe set.  Is that right? 
> Also, I note that other list users with queries relating to Biomart have
> been directed to use the devel version.  I think this doesn't work with R
> 2.2.1?
> 
> It also seems that the Annotate package is only suitable for species that
> Bioconductor has specifically created libraries for, and that there are
> currently only libraries for human, mouse and rat ... so not suitable for
> me either?
> 
> Lastly, the Annbuilder package sounds most like what I'm after, but I'm a
> bit confused about whether it is limited in the public data repositories
> it can use, as the probe IDs I have are from Ensembl, not Entrez-gene. 
> Also I gather I would have to query the data package that Annbuilder
> creates every time I want the annotation info for a given list of genes,
> rather than it being linked to my .gpr or .GAL files.  Have I understood
> that correctly, and if so is there any way to link annotation info to the
> .GAL file itself? Also is Pearl something one has to download in order to
> use this package (please excuse the very naive question as I'm not a
> bioinformatician)?
> 
> So just to recap; all I actually want to do is merge the AGAM P3
> annotation list with my .GAL file, and make sure that the new columns
> appear as part of the output from limma, etc.
> 
> Looking forward to any advice / suggestions,
> 
> Regards,
> Amy
> 
> R: 2.2.1, Bioconductor: 1.7, OS: windows XP.
> 
> -------------------------------------------
> Amy Mikhail
> Research student
> University of Aberdeen
> Zoology Building
> Tillydrone Avenue
> Aberdeen AB24 2TZ
> Scotland
> Email: a.mikhail at abdn.ac.uk
> Phone: 00-44-1224-272880 (lab)
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623


**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.



More information about the Bioconductor mailing list