[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays
James W. MacDonald
jmacdon at uw.edu
Wed Mar 6 19:33:17 CET 2013
Hi Kamila,
On 3/6/2013 10:17 AM, Naxerova, Kamila wrote:
> Hi Jim,
>
> thank you for your helpful reply. I have a few follow-up questions.
>> I should throw in my obligatory cautionary statement about summarizing
>> Gene ST data at the probeset (as compared to the transcript) level. If
>> you look at the number of probes/probeset, there are a huge number with
>> < 4 probes. So hypothetically you can do this, but I wouldn't.
> I am bit confused about transcript clusters and probesets. In the MoGene-2_0-st-v1.na33.mm10.transcript.csv file, each transcript cluster corresponds to exactly one probe set. But from your email it sounds like there are more probesets than transcript clusters -- I assume these are stored in a different file? Unfortunately the structure of the Affymetrix web site is a mystery to me, without your direct link I would have never found the transcript annotation file, so I have no way of browsing and checking out other annotation files to better understand what is going on.
Maybe I should use a different terminology, or maybe Affy should be more
consistent. ;-D
Anyway, as you note there are two columns in that file, one called
transcript cluster and the other is probeset. But note that they are
identical.
When I say probeset, this is based on the fact that the Gene ST arrays
are 'cut down' versions of the Exon ST arrays, which in general have 4
probes per probeset, and each probeset is supposed to interrogate an
exon (or portion thereof). So in my terminology, the probeset
corresponds to the original probesets from the Exon arrays, and the
annotations for that level of summarization can be found in the
MoGene-2_0-st-v1.na33.mm10.probeset.csv annotation file.
And yeah, Affy should improve their crappy website. But it's way better
than the Illumina or Agilent sites, so maybe we should count ourselves
lucky. I found the csv file not by searching on their website, but by
letting the googles find it for me, by searching on 'mouse gene 2.0 st
annotation'. I then get this page
http://www.affymetrix.com/estore/browse/products.jsp?productId=131476#1_3
and then what I wanted is under the Technical Documentation tab.
>
> Why is there a distinction between transcript cluster and probeset in the first place? I understand that it's useful to be able to group probes dynamically (based on our state of knowledge about a locus). If this grouping is defined as the transcript cluster, what is the definition of a probeset?
As I note above, the Gene ST arrays were created by taking the 'good'
probes from the Exon array. So the notion of a probeset is based on the
original construction of the probesets on the Exon array, which were
usually 4 probes. But since Affy only took the good probes, tons of the
probesets on the Gene ST arrays are made up of 3 or fewer probes.
>
> Do I assume correctly that if I build my annotation using the MoGene-2_0-st-v1.na33.mm10.transcript.csvfile, I essentially commit to analyzing my data on the transcript level?
>> library(AnnotationForge)
>> library(mouse.db0)
>> library(org.Mm.eg.db)
>> makeDBPackage("MOUSECHIP_DB",
>> affy=TRUE,
>> prefix="mogene20sttranscriptcluster",
>> fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
>> outputDir = ".",
>> version="2.11.1",
>> manufacturer = "Affymetrix",
>> chipName = "Human Gene 2.1 ST Array",
>> manufacturerUrl = "http://www.affymetrix.com",
>> author = "Kamila Naxerova",
>> maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
>>
>>
> Any thoughts on this error message?
Yeah, I forgot that this isn't a slam dunk like with the 3'-biased
arrays. Here is the problem in a nutshell:
[jmacdon at adam2 tmp]$ awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}'
MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 | head -n 3
"17210850","---"
"17210852","---"
"17210855","NM_008866 // Lypla1 // lysophospholipase 1 // 1 A1|1 //
18777 /// ENSMUST00000027036 // Lypla1 // lysophospholipase 1 // 1 A1|1
// 18777 /// BC013536 // Lypla1 // lysophospholipase 1 // 1 A1|1 //
18777 /// BC052848 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777
/// U89352 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 ///
CT010201 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777 ///
ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1 // 18777
/// ENSMUST00000150971 // Lypla1 // lysophospholipase 1 // 1 A1|1 //
18777 /// ENSMUST00000134384 // Lypla1 // lysophospholipase 1 // 1 A1|1
// 18777 /// ENSMUST00000155020 // Lypla1 // lysophospholipase 1 // 1
A1|1 // 18777 /// ENSMUST00000141278 // Lypla1 // lysophospholipase 1 //
1 A1|1 // 18777 /// AK050549 // Lypla1 // lysophospholipase 1 // 1 A1|1
// 18777 /// AK167231 // Lypla1 // lysophospholipase 1 // 1 A1|1 //
18777 /// ENSMUST00000115529 // Lypla1 // lysophospholipase 1 // 1 A1|1
// 18777 /// ENSMUST00000137887 // Lypla1 // lysophospholipase 1 // 1
A1|1 // 18777 /// AK034851 // Lypla1 // lysophospholipase 1 // 1 A1|1 //
18777 /// ENSMUST00000131119 // Lypla1 // lysophospholipase 1 // 1 A1|1
// 18777 /// ENSMUST00000119612 // Lypla1 // lysophospholipase 1 // 1
A1|1 // 18777"
So these are the data for the first three transcript clusters. The first
two are mystery clusters with no annotation. The third is Lypla1, but
you can see that the data are splayed out in Affy's preferred format of
/// for a transcript with // separating the info for that transcript.
What we need is something like
17210855 NM_008866
without all that other cruft. I thought I had some code floating around
that I used to parse these data for the HuGene 2.0 ST array, but I can't
find it at the moment. But suffice it to say that you want to just have
the first column, and then one of the RefSeq IDs in the second column.
There aren't enough rows here to necessitate using something like Perl
or an Awk script to do the parsing, you could just do it in R. Something
like
awk -F, '{if($1 !~ /#|[:alpha:]/) print $0}'
MoGene-2_0-st-v1.na33.mm10.transcript.csv | cut -d, -f 1,8 > tmp.csv
to get the requisite columns, then you could read in using read.csv()
and then something like this,
dat <- read.csv("tmp.csv", header=FALSE, stringsAsFactors = FALSE)
dat$fixed <- sapply(sapply(strsplit(dat[1:5,2], " /// "), function(x)
sapply(strsplit(grep("^N.", x, value=T), " // "), "[", 1)[1]),
function(z) if(is.null(z)) return(NA) else return(z[1]))
which naively takes just the first RefSeq looking thing. There are
likely other more sophisticated things that one could do. But if the
results look OK, then you could just write out columns 1 and 3 and then
use that as input for building the package.
Best,
Jim
>
>> makeDBPackage("MOUSECHIP_DB",
> + affy=TRUE,
> + prefix="mogene20sttranscriptcluster",
> + fileName="MoGene-2_0-st-v1.na33.mm10.transcript.csv",
> + outputDir = ".",
> + version="2.11.1",
> + manufacturer = "Affymetrix",
> + chipName = "Mouse Gene 2.0 ST Array",
> + manufacturerUrl = "http://www.affymetrix.com",
> + author = "Kamila Naxerova",
> + maintainer = "Kamila Naxerova<naxerova at fas.harvard.edu>")
> Error in `[.data.frame`(csvFile, , GenBankIDName) :
> undefined columns selected
>
>
>> sessionInfo()
> R version 2.15.3 (2013-03-01)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] org.Mm.eg.db_2.8.0 mouse.db0_2.8.0 AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0 RSQLite_0.11.2 DBI_0.2-5 AnnotationDbi_1.20.5 Biobase_2.18.0
> [9] BiocGenerics_0.4.0 BiocInstaller_1.8.3
>
> loaded via a namespace (and not attached):
> [1] IRanges_1.16.6 parallel_2.15.3 stats4_2.15.3 tools_2.15.3
>
>
>
> Many thanks!
> Kamila
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list