[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Wed Mar 6 19:33:17 CET 2013

Hi Kamila,

On 3/6/2013 10:17 AM, Naxerova, Kamila wrote:
> Hi Jim,
>
> thank you for your helpful reply. I have a few follow-up questions.
>> I should throw in my obligatory cautionary statement about summarizing
>> Gene ST data at the probeset (as compared to the transcript) level. If
>> you look at the number of probes/probeset, there are a huge number with
>> <  4 probes. So hypothetically you can do this, but I wouldn't.
> I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on.

Maybe I should use a different terminology, or maybe Affy should be more 
consistent. ;-D

Anyway, as you note there are two columns in that file, one called 
transcript cluster and the other is probeset. But note that they are 
identical.

When I say probeset, this is based on the fact that the Gene ST arrays 
are 'cut down' versions of the Exon ST arrays, which in general have 4 
probes per probeset, and each probeset is supposed to interrogate an 
exon (or portion thereof). So in my terminology, the probeset 
corresponds to the original probesets from the Exon arrays, and the 
annotations for that level of summarization can be found in the 
MoGene-2_0-st-v1.na33.mm10.probeset.csv annotation file.

And yeah, Affy should improve their crappy website. But it's way better 
than the Illumina or Agilent sites, so maybe we should count ourselves 
lucky. I found the csv file not by searching on their website, but by 
letting the googles find it for me, by searching on 'mouse gene 2.0 st 
annotation'. I then get this page

http://www.affymetrix.com/estore/browse/products.jsp?productId=131476#1_3

and then what I wanted is under the Technical Documentation tab.

>
> Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset?

As I note above, the Gene ST arrays were created by taking the 'good' 
probes from the Exon array. So the notion of a probeset is based on the 
original construction of the probesets on the Exon array, which were 
usually 4 probes. But since Affy only took the good probes, tons of the 
probesets on the Gene ST arrays are made up of 3 or fewer probes.

>
> Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile,  I essentially commit to analyzing my data on the transcript level?
>> library(AnnotationForge)
>> library(mouse.db0)
>> library(org.Mm.eg.db)
>> makeDBPackage("MOUSECHIP_DB",
>> affy=TRUE,
>> prefix="mogene20sttranscriptcluster",
>> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
>> outputDir = ".",
>> version="2.11.1",
>> manufacturer = "Affymetrix",
>> chipName = "Human Gene 2.1 ST Array",
>> manufacturerUrl = "http://www.affymetrix.com",
>> author = "Kamila Naxerova",
>> maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>>
>>
> Any thoughts on this error message?

Yeah, I forgot that this isn't a slam dunk like with the 3'-biased 
arrays. Here is the problem in a nutshell:

[jmacdon at adam2 tmp]$ awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' 
MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 | head -n 3
"17210850","---"
"17210852","---"
"17210855","NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 
18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1 
// 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 
18777 /// BC052848 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 
/// U89352 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// 
CT010201 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 /// 
ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 
/// ENSMUST00000150971 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 
18777 /// ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1 
// 18777 /// ENSMUST00000155020 // Lypla1 // lysophospholipase 1 // 1 
A1|1 // 18777 /// ENSMUST00000141278 // Lypla1 // lysophospholipase 1 // 
1 A1|1 // 18777 /// AK050549 // Lypla1 // lysophospholipase 1 // 1 A1|1 
// 18777 /// AK167231 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 
18777 /// ENSMUST00000115529 // Lypla1 // lysophospholipase 1 // 1 A1|1 
// 18777 /// ENSMUST00000137887 // Lypla1 // lysophospholipase 1 // 1 
A1|1 // 18777 /// AK034851 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 
18777 /// ENSMUST00000131119 // Lypla1 // lysophospholipase 1 // 1 A1|1 
// 18777 /// ENSMUST00000119612 // Lypla1 // lysophospholipase 1 // 1 
A1|1 // 18777"

So these are the data for the first three transcript clusters. The first 
two are mystery clusters with no annotation. The third is Lypla1, but 
you can see that the data are splayed out in Affy's preferred format of 
/// for a transcript with // separating the info for that transcript. 
What we need is something like

17210855   NM_008866

without all that other cruft. I thought I had some code floating around 
that I used to parse these data for the HuGene 2.0 ST array, but I can't 
find it at the moment. But suffice it to say that you want to just have 
the first column, and then one of the RefSeq IDs in the second column. 
There aren't enough rows here to necessitate using something like Perl 
or an Awk script to do the parsing, you could just do it in R. Something 
like

awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}' 
MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 > tmp.csv

to get the requisite columns, then you could read in using read.csv() 
and then something like this,

dat <- read.csv("tmp.csv", header=FALSE, stringsAsFactors = FALSE)
dat$fixed <- sapply(sapply(strsplit(dat[1:5,2], " /// "), function(x) 
sapply(strsplit(grep("^N.", x, value=T), " // "), "[", 1)[1]), 
function(z) if(is.null(z)) return(NA) else return(z[1]))

which naively takes just the first RefSeq looking thing. There are 
likely other more sophisticated things that one could do. But if the 
results look OK, then you could just write out columns 1 and 3 and then 
use that as input for building the package.

Best,

Jim

>
>> makeDBPackage("MOUSECHIP_DB",
> + affy=TRUE,
> + prefix="mogene20sttranscriptcluster",
> + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
> + outputDir = ".",
> + version="2.11.1",
> + manufacturer = "Affymetrix",
> + chipName = "Mouse Gene 2.0 ST Array",
> + manufacturerUrl = "http://www.affymetrix.com",
> + author = "Kamila Naxerova",
> + maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
> Error in `[.data.frame`(csvFile, , GenBankIDName) :
>    undefined columns selected
>
>
>> sessionInfo()
> R version 2.15.3 (2013-03-01)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] org.Mm.eg.db_2.8.0    mouse.db0_2.8.0       AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0    RSQLite_0.11.2        DBI_0.2-5             AnnotationDbi_1.20.5  Biobase_2.18.0
>   [9] BiocGenerics_0.4.0    BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.16.6  parallel_2.15.3 stats4_2.15.3   tools_2.15.3
>
>
>
> Many thanks!
> Kamila

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099