[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays
cstrato at aon.at
Wed Mar 6 19:26:37 CET 2013
Here is some history, as far as I remember:
Originally (1999) Affymetrix sold the Hu6800 (HuGeneFL) array, which was
an ivt array with most probes (oligos) located on the 3'-end. At that
time the collection of probes which represented one gene was called a
probeset, and there was (and still is) only one annotation file for the
ivt arrays. In addition there was one library file, the Hu6800.CDF file.
Most ivt arrays still have CDF-files which map the probes to the (x,y)
location on the arrays.
With the introduction of the HuExon 1.0 ST array as first exon array
Affymetrix did change a couple of things:
- they replaced one CDF-file with two files, called CLF-file and
- in addition they provide now two annotation files, a
transcript-cluster annotation file and a probeset annotation file. A
certain gene (transcript-cluster) typically consists of one ore more
exons, one exon consists of one ore more probesets, and one probeset
consists usually of 2-4 probes (oligos).
The probeset annotation file does now list all probesets with their
"probeset_id", as well as the "transcript_cluster_id" and the "exon_id"
for each probeset. In contrast the transcript annotation file does list
only the genes with their "transcript_cluster_id".
At some later time Affymetrix introduced a cheaper 'exon' array, the
'Whole Genome' HuGene 1.0 ST array since many labs were mainly
interested in the expression of the genes and not of the different
exons. This cheaper array typically has only one probe per exon.
Originally Affymetrix sold this array as an array to measure gene
expression, and there was only one annotation file. Lateron they decided
to convert the HuGene array to an exon array, too. Thus now all WT
arrays do contain both a probeset annotation file and a transcript
annotation file. In principle you could now use WT arrays to measure the
expression of single exons, however with the disadvantage that usually
there is only one probe per exon.
To understand the distinction between probeset and transcript annotation
files please look at the annotation files, and especially read the
README files which Affymetrix usually provides in the annotation zip-files.
I hope this history does help you to understand the difference between
these two annotation files.
On 3/6/13 4:17 PM, Naxerova, Kamila wrote:
> Hi Jim,
> thank you for your helpful reply. I have a few follow-up questions.
>> I should throw in my obligatory cautionary statement about summarizing
>> Gene ST data at the probeset (as compared to the transcript) level. If
>> you look at the number of probes/probeset, there are a huge number with
>> < 4 probes. So hypothetically you can do this, but I wouldn't.
> I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on.
> Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset?
> Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile, I essentially commit to analyzing my data on the transcript level?
>> outputDir = ".",
>> manufacturer = "Affymetrix",
>> chipName = "Human Gene 2.1 ST Array",
>> manufacturerUrl = "http://www.affymetrix.com",
>> author = "Kamila Naxerova",
>> maintainer = "Kamila Naxerova <naxerova at fas.harvard.edu>")
> Any thoughts on this error message?
> + affy=TRUE,
> + prefix="mogene20sttranscriptcluster",
> + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
> + outputDir = ".",
> + version="2.11.1",
> + manufacturer = "Affymetrix",
> + chipName = "Mouse Gene 2.0 ST Array",
> + manufacturerUrl = "http://www.affymetrix.com",
> + author = "Kamila Naxerova",
> + maintainer = "Kamila Naxerova <naxerova at fas.harvard.edu>")
> Error in `[.data.frame`(csvFile, , GenBankIDName) :
> undefined columns selected
> R version 2.15.3 (2013-03-01)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>  en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
>  stats graphics grDevices utils datasets methods base
> other attached packages:
>  org.Mm.eg.db_2.8.0 mouse.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.5 Biobase_2.18.0
>  BiocGenerics_0.4.0 BiocInstaller_1.8.3
> loaded via a namespace (and not attached):
>  IRanges_1.16.6 parallel_2.15.3 stats4_2.15.3 tools_2.15.3
> Many thanks!
> Bioconductor mailing list
> Bioconductor at r-project.org
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor