[BioC] annotation package for chicken affyprobes

Nianhua Li nli at fhcrc.org
Wed Aug 30 01:51:05 CEST 2006


Dear Lina,

Sorry for the late reply. As I mentioned in the previous email, the UCSC
genome annotation of chicken is obtained from
http://hgdownload.cse.ucsc.edu/goldenPath/galGal3/database/ , which
"contains a dump of the UCSC genome annotation database for the May 2006
assembly of the chicken genome (galGal3, Chicken Genome Sequencing
Consortium May 2006 release)".  We use two files to get chromosome
location information: refGene.txt.gz and refLink.txt.gz.

I downloaded the current version of these two files (both dated Aug 27,
2006), import them to sqlite and got some "statistics "of the data:
(1) refGene only have 3847 records, which means only 3847 sequences have
chromosome location information.
(2) We draw annotations from refGene by using the second column in the
file: accession # of RefSeq records representing mRNA sequence. There
are only 3730 unique accession # in refGene.
(3) We use refLink to obtain Entrez Gene to RefSeq mapping. refLink
covers 154286 unique Entrez Gene IDs, and 174215 unique RefSeq accession
numbers (for mRNA). But after merging this information with refGene, we
only get chromosome location information for 3710 unique Entrez Gene
IDs. So, 3710 is maxim number of annotations one can get from UCSC given
a list of chicken Entrez Gene IDs.
(4) The affy2entrez mapping file you provide covers 13229 unique Entrez
Gene IDs, 3609 of them overlap with the 3710 Entrez Gene IDs we got in
(3). So, the mapping file actually did a pretty good job.

Overall, I think it is a "problem" of UCSC genome database. How do you
think about it? If you have any suggestions for a better data source
(i.e. a different file from UCSC FTP site), I can modify the code
accordingly if others agree on.

nianhua



More information about the Bioconductor mailing list