[Bioc-devel] ag and ath1121501
Nianhua Li
nli at fhcrc.org
Wed Jun 14 18:40:52 CEST 2006
Hi,
I did a close look at the athPkgBuilder function in AnnBuilder (the
function to generate ath1121501 and ag) and have some questions about
the data source of ath1121501 and ag:
1. probeset id to gene mapping:
The current mapping strategy was
1) map probe id to "Representative.Public.ID" by using Affymetrix
GeneChip annotation data
2) use "Representative.Public.ID" as if it was AGI locus id to get other
annotations (pathway, go, etc) from TAIR
It seems that the "Representative.Publid.ID is a mix of AGI locus id,
UniGene Cluster and a small part of other sources. In the affymetrix
annotation file, there is another column called "Transcript ID (Array
Design)", which has almost the same value as
"Prepresentative.Public.ID". I feel it was originated from
ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether
affymetrix update those two columns on a regular basis or not.
But if all the annotations (chromosome, go, pathway) come from TAIR,
maybe we should use TAIR's mapping of probeset id to AGI locus id:
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ :
"The oligonucleotide sequences of the probes were mapped to the
Arabidopsis
Transcripts dataset from the Arabidopsis genome TAIR6 version (released
November 11, 2005).
The dataset included mitochondria and chloroplast genes, as well as
pseudogenes and non-
coding RNAs. The mapping to the TAIR6 Transcripts was performed using
the BLASTN program
with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the
Affy chips, the
required match length to achieve this e-value is 23 or more identical
nucleotides. To
assign a probe set to a given locus, at least 9 of the probes included
in the probe set
were required to match a transcript at that locus."
Not all probeset ids have matching AGI locus ids. Do we need to provide
mapping to other gene identifiers such as GenBank Accession number or
Entrez Gene IDs to make annoations more complete? Affymetrix starts to
provide probeset id to Entrez Gene ID mappings in their annotation
files. Should we include that information? Also, I can see three
possible ways to get probe-to-GenBank mapping: 1) from affymetrix
annotation file directly, 2)probe to AGI locus and then AGI locus to
GenBank accession, all from TAIR, 3)probe to Entrez Gene from affy, and
then Entrez Gene to GenBank from NCBI. Which way is the best? or should
we use the "voting" algorithm used by ABPkgBuilder?
2. chromosome location
The current package get chromosome locations from
ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.Locus
Even though the file seems being updated very often, the directory it
locates in and the README file were not. So, it is not clear for me how
it was generated/updated. Any hint on that? Will
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better
source? The meaning of chromosome location in those two sources may be
different though. The former means the location of a GenBank EST, and
the later means "chromosome coordinates of the best probe set match to
the Transcripts
dataset".
3. gene description (ath1121501GENENAME)
The current package (1.12.1) get the description from
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The
descriptions are the same as
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both of them
means the description of the AGI locus corresponding to a affy probeset.
In the Affymetrix annotation file, there is a column called "Target
Description". It is the description of the gene that a probeset is
targeting to. All probesets have descriptions, therefore we get a better
coverage than getting description from TAIR. When the "Representative
Public ID" (or "Transcript ID") is a AGI locus id, it seems the
description was retrieved from TAIR. However, it is not clear how this
information is updated, and whether it is synchronized with TAIR's
update or not. Another possible source of description is Entrez Gene,
since Affymetrix maps probeset to Entrez Gene.
4. pathway
Pathway information is currently obtained from AraCyc, a pathway tool in
TAIR: http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel
it only contains metabolic pathways (it can be wrong as I only read the
introduction). KEGG contains regulatory pathways as well, and it is also
manually curated. Those two sources are independant from each other.
Shall we include both of them?
5. pubmed
Probeset to pubmed mapping is currently obtained from
ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060309.txt
. The pubmed ids represents the publications that TAIR used to map a AGI
locus id to a concept in Plant Ontology. But I think environment like
ath1121501PUBMED should represent the publications about the matching
gene of a probeset. I didn't find AGI locus to pubmed mapping in TAIR.
So, we have to get it from either Entrez Gene id or GenBank accession.
This gets back to the frist question: what is the best way to map
probeset to GenBank/Entrez Gene?
Hope this email is not too long. Any feedback will be highly
appreciated. If we decide to use a better data source, I will be happy
to do the implementation.
many thanks
Nianhua Li
computational biology, public health, FHCRC
More information about the Bioc-devel
mailing list