[BioC] annotation package for chicken affyprobes

Wed Aug 30 15:30:06 CEST 2006

Dear Nianhua,

Our communication gets a bit delayed by the time difference :-). I took a
look at some of the other files at UCSC ftp site and it seems as if the
files all_mrna.txt.gz, mrnaOrientInfo.txt.gz, all_est.txt.gz and
estOrientInfo.txt.gz provide better chromosome location information. They
all contain chromosome information for genbank ids and I matched them
towards the affy2genbank file I had (25654 probeset ids mapped to genbank
ids), the resulting matches can be seen below.

File			Total gb ids	Matching gb ids
mrnaOrientInfo	30946			10993
all_mrna		27098			10392
est_OrientInfo	616962		11899
all_est		616978		11899

I guess it might be possible to increase the mappings even more. Still the
coverage isn't very good so I think I will go for trying to blast the
probeset sequences to the genome to get chromosome location. I might wait
until the galGal3 version is available at Ensembl though before doing that.

/Lina

-----Ursprungligt meddelande-----
Från: Nianhua Li [mailto:nli at fhcrc.org] 
Skickat: den 30 augusti 2006 01:51
Till: Lina Hultin-Rosenberg
Kopia: bioconductor at stat.math.ethz.ch
Ämne: Re: SV: [BioC] annotation package for chicken affyprobes

Dear Lina,

Sorry for the late reply. As I mentioned in the previous email, the UCSC
genome annotation of chicken is obtained from
http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/ , which
"contains a dump of the UCSC genome annotation database for the May 2006
assembly of the chicken genome (galGal3, Chicken Genome Sequencing
Consortium May 2006 release)".  We use two files to get chromosome
location information: refGene.txt.gz and refLink.txt.gz.

I downloaded the current version of these two files (both dated Aug 27,
2006), import them to sqlite and got some "statistics "of the data:
(1) refGene only have 3847 records, which means only 3847 sequences have
chromosome location information.
(2) We draw annotations from refGene by using the second column in the
file: accession # of RefSeq records representing mRNA sequence. There
are only 3730 unique accession # in refGene.
(3) We use refLink to obtain Entrez Gene to RefSeq mapping. refLink
covers 154286 unique Entrez Gene IDs, and 174215 unique RefSeq accession
numbers (for mRNA). But after merging this information with refGene, we
only get chromosome location information for 3710 unique Entrez Gene
IDs. So, 3710 is maxim number of annotations one can get from UCSC given
a list of chicken Entrez Gene IDs.
(4) The affy2entrez mapping file you provide covers 13229 unique Entrez
Gene IDs, 3609 of them overlap with the 3710 Entrez Gene IDs we got in
(3). So, the mapping file actually did a pretty good job.

Overall, I think it is a "problem" of UCSC genome database. How do you
think about it? If you have any suggestions for a better data source
(i.e. a different file from UCSC FTP site), I can modify the code
accordingly if others agree on.

nianhua