[Bioc-devel] ag and ath1121501

Wed Jun 14 18:40:52 CEST 2006

Hi,

I did a close look at the athPkgBuilder function in AnnBuilder (the 
function to generate ath1121501 and ag) and have some questions about 
the data source of ath1121501 and ag:

1. probeset id to gene mapping:
The current mapping strategy was
1) map probe id to "Representative.Public.ID" by using Affymetrix 
GeneChip annotation data
2) use "Representative.Public.ID" as if it was AGI locus id to get other 
annotations (pathway, go, etc) from TAIR

It seems that the "Representative.Publid.ID is a mix of AGI locus id, 
UniGene Cluster and a small part of other sources. In the affymetrix 
annotation file, there is another column called "Transcript ID (Array 
Design)", which has almost the same value as 
"Prepresentative.Public.ID". I feel it was originated from 
ftp://ftp.tigr.org/pub/data/a_thaliana/Affymetrix/. Not sure whether 
affymetrix update those two columns on a regular basis or not.

But if all the annotations (chromosome, go, pathway) come from TAIR, 
maybe we should use TAIR's mapping of  probeset id to AGI locus id: 
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ :
    "The oligonucleotide sequences of the probes were mapped to the 
Arabidopsis
Transcripts dataset from the Arabidopsis genome TAIR6 version (released 
November 11, 2005).
The dataset included mitochondria and chloroplast genes, as well as 
pseudogenes and non-
coding RNAs. The mapping to the TAIR6 Transcripts was performed using 
the BLASTN program
with e-value cutoff < 9.9e-6. For the 25-mer oligo probes used on the 
Affy chips, the
required match length to achieve this e-value is 23 or more identical 
nucleotides. To
assign a probe set to a given locus, at least 9 of the probes included 
in the probe set
were required to match a transcript at that locus."

Not all probeset ids have matching AGI locus ids. Do we need to provide 
mapping to other gene identifiers such as GenBank Accession number or 
Entrez Gene IDs to make annoations more complete? Affymetrix starts to 
provide probeset id to Entrez Gene ID mappings in their annotation 
files. Should we include that information? Also, I can see three 
possible ways to get probe-to-GenBank mapping: 1) from affymetrix 
annotation file directly, 2)probe to AGI locus and then AGI locus to 
GenBank accession, all from TAIR, 3)probe to Entrez Gene from affy, and 
then Entrez Gene to GenBank from NCBI. Which way is the best? or should 
we use the "voting" algorithm used by ABPkgBuilder?

2. chromosome location
The current package get chromosome locations from 
ftp://ftp.arabidopsis.org/home/tair/Genes/est_mapping/est.Assignment.Locus
Even though the file seems being updated very often, the directory it 
locates in and the README file were not. So, it is not clear for me how 
it was generated/updated. Any hint on that? Will 
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ be a better 
source? The meaning of chromosome location in those two sources may be 
different though. The former means the location of a GenBank EST, and 
the later means "chromosome coordinates of the best probe set match to 
the Transcripts
dataset".

3. gene description (ath1121501GENENAME)
The current package (1.12.1) get the description from 
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR_sequenced_genes The 
descriptions are the same as 
ftp://ftp.arabidopsis.org/home/tair/Microarrays/Affymetrix/ Both of them 
means the description of the AGI locus corresponding to a affy probeset. 
In the Affymetrix annotation file, there is a column called "Target 
Description". It is the description of the gene that a probeset is 
targeting to. All probesets have descriptions, therefore we get a better 
coverage than getting description from TAIR. When the "Representative 
Public ID" (or "Transcript ID") is a AGI locus id, it seems the 
description was retrieved from TAIR. However, it is not clear how this 
information is updated, and whether it is synchronized with TAIR's 
update or not. Another possible source of description is Entrez Gene, 
since Affymetrix maps probeset to Entrez Gene.

4. pathway
Pathway information is currently obtained from AraCyc, a pathway tool in 
TAIR: http://www.arabidopsis.org/tools/aracyc/introduction.jsp . I feel 
it only contains metabolic pathways (it can be wrong as I only read the 
introduction). KEGG contains regulatory pathways as well, and it is also 
manually curated. Those two sources are independant from each other. 
Shall we include both of them?

5. pubmed
Probeset to pubmed mapping is currently obtained from 
ftp://ftp.arabidopsis.org/home/tair/Ontologies/Plant_Ontology/stru-060309.txt 
. The pubmed ids represents the publications that TAIR used to map a AGI 
locus id to a concept in Plant Ontology. But I think environment like 
ath1121501PUBMED should represent the publications about the matching 
gene of a probeset. I didn't find AGI locus to pubmed mapping in TAIR. 
So, we have to get it from either Entrez Gene id or GenBank accession. 
This gets back to the frist question: what is the best way to map 
probeset to GenBank/Entrez Gene?

Hope this email is not too long. Any feedback will be highly 
appreciated. If we decide to use a better data source, I will be happy 
to do the implementation.

many thanks

Nianhua Li
computational biology, public health, FHCRC