[BioC] Affymetrix Array Rat Gene 1.0 ST - some annotation challenges

Wed Dec 3 03:45:25 CET 2008

I apologize if this is not the right forum to post this question - I am asking about annotation for a particular kind of Affymetrix array. I have all the files from their site. The problem is not too little information, it is too much information.

The array platform is  Affymetrix Array "Rat Gene 1.0 ST". I am sure this is similar to Mouse Gene ST etc which has been discussed on this list

Using instructions form James McDonald posted on this "gmane" website,
http://article.gmane.org/gmane.science.biology.informatics.conductor/18963/match=oligo+s   , I was able to  process the cel files for this array. The instructions worked just fine, and I was able to do all the analysis in oligo that I am usually able to do using affy - including quality plots, normalization, differential expression, etc. 

I am finding the annotation to be a bit of a challenge though. After using  rma in oligo, the ExpressionSet object has the probe ID's along with the expression values. Also there is a file "RaGene-1_0-st- v1.na26.rn4.transcript.csv"  from the Affymetrix web site that can be downloaded along with appropriate readme files. This csv file maps the probe id's   to "gene_assignment", "mrna_assignment", etc. So far so good ...

The challenge is that both the gene assignment and the mrna assignment columns often have multiple genes and multiple mrna's or ensemble ID's . The readme file ("RaGene-1_0-st-v1.na26.AFFX_README.NetAffx-CSV-Files.txt"), also from the affy website, describes what is in these columns . For example the columns contain "assignment scores" and "coverage scores"   "between a public mRNA and a transcript cluster". the higher the scores, the better the probe (or transcript cluster) matches to the mRNA, or visa versa. 

My challenge is, how do I condense this annotation down in an efficient manner for the principle investigator? I was thinking of just taking the first transcript assignment from  the "gene_assignment" and "mrna_assignment" columns, but not sure this is the right thing to do. I suppose I could somehow  take the assignments with the highest scores , but I think someone may know a better and faster way.

I did try using the "annaffy" package (for example , the function "aafGenBank"), but the ragene10st.db package cannot be found on the bioconductor website (I do see the mogene10st.db package on there though). I was also going to try exonmap (even though this is not an exon array), but have had trouble loading the package so far

Has anyone run into this annotation problem for these types of arrays? Any suggestions on how to come up with reasonable annotation for each probe id?

I am using R 2.8.0 and the latest release of Bioconductor (2.3) on a
Windows XP  machine.

Thanks, 

Alex Cambon
Biostatistician
Department of Bioinformatics and Biostatistics
School of Public Health and Information Sciences
University of Louisville, Louisville, KY 40292

502-852-4111