[BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Hooiveld, Guido Guido.Hooiveld at wur.nl
Wed Mar 6 16:29:23 CET 2013

Hi James and others,
Since I will soon get my first set of Human Gene ST v2.0 arrays, your remark about creating a BioC-compatible annotation library for such new arrays was of particular interest for me. I therefore tried to generate such library based on the code you suggested. However, it doesn't work (see below): the CSV annotation file downloaded from Affymetrix has no column labeled "Representative Public ID", as is expected by the function makeBaseMaps. Note that the NetAffx annotation csv file for e.g. the (old) human HGU133plus2 arrays does indeed contain such column. Moreover, I also noticed that for the Gene ST arrays target sequences have been derived from multiple databases, including both NCBI and ENSEMBL. As a consequence the NetAffx annotation file for the Gene ST arrays contain in the column "gene_assignment" or "mrna_assignment" multiple ID types. Would the library 'annotationForge' be able to handle this at all?


BTW: in case it is relevant: the NetAffx annotation file for the 3'-IVT array (HGU133plus2) was build according to " #%netaffx-annotation-tabular-format-version=1.0", whereas the file for the HuGene ST v2.0 was build according to " #%netaffx-annotation-tabular-format-version=1.1"

> makeDBPackage("HUMANCHIP_DB",
+ affy=TRUE,
+ prefix="hugene20sttranscriptcluster",
+ fileName="HuGene-2_0-st-v1.na33.hg19.probeset.csv",
+ outputDir = ".",
+ version="2.11.1",
+ manufacturer = "Affymetrix",
+ chipName = "Human Gene 2.0 ST Array",
+ manufacturerUrl = "http://www.affymetrix.com", author = "N.N.", maintainer = "me at mail.com")
        Error in `[.data.frame`(csvFile, , GenBankIDName) : 
  undefined columns selected
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-unknown-linux-gnu (64-bit)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] human.db0_2.8.0       AnnotationForge_1.0.3 org.Hs.eg.db_2.8.0   
[4] RSQLite_0.11.2        DBI_0.2-5             AnnotationDbi_1.20.3 
[7] Biobase_2.18.0        BiocGenerics_0.4.0   

loaded via a namespace (and not attached):
[1] IRanges_1.16.6  parallel_2.15.2 stats4_2.15.2  

-----Original Message-----
From: bioconductor-bounces at r-project.org [mailto:bioconductor-bounces at r-project.org] On Behalf Of James W. MacDonald
Sent: Tuesday, March 05, 2013 23:23
To: Naxerova, Kamila
Cc: bioconductor at r-project.org
Subject: Re: [BioC] Analysis of Affymetrix Mouse Gene 2.0 ST arrays

Hi Kamila,

On 3/5/2013 4:45 PM, Naxerova, Kamila wrote:
> Dear all,
> I am analyzing a set of Affymetrix Mouse Gene 2.0 ST arrays. I am quite familiar with 3'-biased chips, but this is my first time looking at data from WT arrays. I have a few general questions -- any advice would be appreciated to speed up my learning process.
> 1) I have already read on this mailing list that the good old affy package does not work well with WT arrays (can anybody point me to any literature on why that is?). So I have installed the oligo and xps packages -- what are the advantages/disadvantages for each? Any opinions on which one is the right "starter kit"?

The affy package was never intended to work with these arrays - it was designed specifically for the 3' biased arrays, which had pre-defined probesets, and which didn't share probes between probesets. In addition, the makecdfenv package is designed to work with the old style CDF packages, and Affy has never released a CDF for these new chips that they are willing to support in any meaningful way.

There were some changes made to the affy package in order to accommodate the fact that probes could be shared between probesets, and it is possible to use functions in affxparser to re-create conventional CDF packages using the newer pgf and clf files. So hypothetically you could still use the affy package (and hypothetically you could still use an Apple IIe for all your computing needs, but that's crazy, so let's move on).

I don't think you will find much difference between oligo and xps, other than the fact that xps requires the additional installation of ROOT. You might play around with both and see which suits you better.

I should throw in my obligatory cautionary statement about summarizing Gene ST data at the probeset (as compared to the transcript) level. If you look at the number of probes/probeset, there are a huge number with < 4 probes. So hypothetically you can do this, but I wouldn't.

> 2) I see with some dread that there seems to be no annotation package for the 2.0 array yet. I have never built my own... any quick bullet points on how I would go about doing that for a WT array?

No dread should be required. All you need to do is get the transcript-level annotation file from Affy
and then the AnnotationForge, mouse.db0, and org.Mm.eg.db packages. Then something like

outputDir = ".",
manufacturer = "Affymetrix",
chipName = "Human Gene 2.1 ST Array",
manufacturerUrl = "http://www.affymetrix.com", author = "Kamila Naxerova", maintainer = "Kamila Naxerova <naxerova at fas.harvard.edu>")

should do the trick. You can then install directly from within R by

install.packages("mogene20sttranscriptcluster.db", repos=NULL,

And see

> 3) It seems that RMA is also used for normalization of WT arrays, so that part I am comfortable with. But are there any differences in preprocessing between 3' and WT arrays that I should watch out for?

Not really. I don't use xps, so cannot say for certain how you do things 
with that package, but with oligo it's a simple

abatch <- read.celfiles(list.celfiles())
eset <- rma(abatch)

To normalize and summarize at the transcript level. Note however that 
the annotation for the resulting ExpressionSet will be the 
pd.mogene.2.0.st.v1 package, and if you use annotation(eset) in any 
further calls to do gene annotation, it won't work out. You need to 
first do

annotation(eset) <- "mogene20sttranscriptcluster.db"

One further note: the intronic controls (especially) have an irritating 
habit of popping up in lists of differentially expressed genes. This is 
IMO likely due to mRNA that has not been fully processed to excise the 
introns, but regardless, these probesets tend to have no annotation at 
all, so are not useful without extra work to figure out what they are 
supposed to be measuring. My usual MO is to just summarily excise them 
after e.g., the eBayes() step of an analysis using limma. If you are 
interested, there is a function in the affycoretools package called 
getMainProbes() that will do this for you.



> Thanks so much!
> Kamila
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

James W. MacDonald, M.S.
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099

Bioconductor mailing list
Bioconductor at r-project.org
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list