[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Wed Mar 6 19:26:37 CET 2013

Dear Kamila,

Here is some history, as far as I remember:

Originally (1999) Affymetrix sold the Hu6800 (HuGeneFL) array, which was 
an ivt array with most probes (oligos) located on the 3'-end. At that 
time the collection of probes which represented one gene was called a 
probeset, and there was (and still is) only one annotation file for the 
ivt arrays. In addition there was one library file, the Hu6800.CDF file. 
Most ivt arrays still have CDF-files which map the probes to the (x,y) 
location on the arrays.

With the introduction of the HuExon 1.0 ST array as first exon array 
Affymetrix did change a couple of things:

- they replaced one CDF-file with two files, called CLF-file and 
PGF-file, respectively.

- in addition they provide now two annotation files, a 
transcript-cluster annotation file and a probeset annotation file. A 
certain gene (transcript-cluster) typically consists of one ore more 
exons, one exon consists of one ore more probesets, and one probeset 
consists usually of 2-4 probes (oligos).

The probeset annotation file does now list all probesets with their 
"probeset_id", as well as the "transcript_cluster_id" and the "exon_id" 
for each probeset. In contrast the transcript annotation file does list 
only the genes with their "transcript_cluster_id".

At some later time Affymetrix introduced a cheaper 'exon' array, the 
'Whole Genome' HuGene 1.0 ST array since many labs were mainly 
interested in the expression of the genes and not of the different 
exons. This cheaper array typically has only one probe per exon. 
Originally Affymetrix sold this array as an array to measure gene 
expression, and there was only one annotation file. Lateron they decided 
to convert the HuGene array to an exon array, too. Thus now all WT 
arrays do contain both a probeset annotation file and a transcript 
annotation file. In principle you could now use WT arrays to measure the 
expression of single exons, however with the disadvantage that usually 
there is only one probe per exon.

To understand the distinction between probeset and transcript annotation 
files please look at the annotation files, and especially read the 
README files which Affymetrix usually provides in the annotation zip-files.

I hope this history does help you to understand the difference between 
these two annotation files.

Best regards,
Christian

On 3/6/13 4:17 PM, Naxerova, Kamila wrote:
> Hi Jim,
>
> thank you for your helpful reply. I have a few follow-up questions.
>>
>> I should throw in my obligatory cautionary statement about summarizing
>> Gene ST data at the probeset (as compared to the transcript) level. If
>> you look at the number of probes/probeset, there are a huge number with
>> < 4 probes. So hypothetically you can do this, but I wouldn't.
>
> I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on.
>
> Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset?
>
> Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile,  I essentially commit to analyzing my data on the transcript level?
>>
>> library(AnnotationForge)
>> library(mouse.db0)
>> library(org.Mm.eg.db)
>> makeDBPackage("MOUSECHIP_DB",
>> affy=TRUE,
>> prefix="mogene20sttranscriptcluster",
>> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
>> outputDir = ".",
>> version="2.11.1",
>> manufacturer = "Affymetrix",
>> chipName = "Human Gene 2.1 ST Array",
>> manufacturerUrl = "http://www.affymetrix.com",
>> author = "Kamila Naxerova",
>> maintainer = "Kamila Naxerova <naxerova at fas.harvard.edu>")
>>
>>
>
> Any thoughts on this error message?
>
>> makeDBPackage("MOUSECHIP_DB",
> + affy=TRUE,
> + prefix="mogene20sttranscriptcluster",
> + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
> + outputDir = ".",
> + version="2.11.1",
> + manufacturer = "Affymetrix",
> + chipName = "Mouse Gene 2.0 ST Array",
> + manufacturerUrl = "http://www.affymetrix.com",
> + author = "Kamila Naxerova",
> + maintainer = "Kamila Naxerova <naxerova at fas.harvard.edu>")
> Error in `[.data.frame`(csvFile, , GenBankIDName) :
>    undefined columns selected
>
>
>> sessionInfo()
> R version 2.15.3 (2013-03-01)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] org.Mm.eg.db_2.8.0    mouse.db0_2.8.0       AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0    RSQLite_0.11.2        DBI_0.2-5             AnnotationDbi_1.20.5  Biobase_2.18.0
>   [9] BiocGenerics_0.4.0    BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.16.6  parallel_2.15.3 stats4_2.15.3   tools_2.15.3
>
>
>
> Many thanks!
> Kamila
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>