[Bioc-devel] [BioC] package for human ST gene arrays?

Sun Oct 28 01:09:39 CEST 2007

Moving my response to BioC-devel because the below is a bit more
developer/advanced user oriented (ie I don't want beginners trying to
play with my example code below thinking I am doing things in a
acceptable way).

This response deals with comments made about HuGene support using the
standard suite of makecdfenv/affy and various things that build on top
of that. For future indexing purposes (and trying to follow this tangled
web using the archives of the various mailing lists) some past comments
about this issue are here:

https://stat.ethz.ch/pipermail/bioconductor/2007-June/017740.html
https://stat.ethz.ch/pipermail/bioconductor/2007-October/019794.html

Ok, now that is out of the way. One issue that has been pointed out is
that there is an implicit assumption that there is a one to one mapping
between probes and probesets if you use makecdfenv. This is not
completely true, in particular if you build your cdfenv from a binary
xda format CDF file you will get the multiple mapping. It is only parse
from the text format that you do not get as you say "missing probes".

I leave it as an exercise to the reader as to why Affymetrix chooses to
still ship the CDF files for most expression-like arrays in the text
format rather than the binary/xda format which is much quicker to parse
and pull information from quickly if you just want to read parts of it
at a time. Also I will note that the CDF files for the HuGene and Exon
arrays are explicitly noted as being unsupported both on the Affymetrix
website and in a text file in the zip file that you download to get it.

Also, it is an important point to consider whether it is sensible to
have multiple mappings without having analysis methods that have
awareness of this "feature". Just as others have mentioned that xps and
aroma.affymetrix can be made to work just fine with the multiple mapping
it is true that the affy standard tools can also be brought to bear on
these arrays (including multiple mapping). An example below demonstrates
this. Note that multiple mapping situations become more troublesome for
things where processed information is linked back to probe-location
information (think QC chip image plots using affyPLM).

But before I get to that some questions I put out for comment:
1) Should this divergence in produced cdfenv between text and xda format
CDF files be fixed? (my feeling is yes)
2) Do we want to be including repeated probe in multiple probesets (or
should they be excluded altogether).
3) If we do include mutiple mapping probes should there be some way of
accessing this information so that analysis methods could be made to be
aware of these probes?

Best,

Ben

My example code (I'm fixing the two issues mentioned in the comments).

###
### First use affxparser to convert the text CDF file to the
### binary xda CDF format
###
### Make sure to test that they are equivalent
###

> library(affxparser)
> convertCdf("HuGene-1_0-st-v1.r3.cdf","HuGene-1_0-st-v1.r3.xda.cdf")
> compareCdfs("HuGene-1_0-st-v1.r3.cdf","HuGene-1_0-st-v1.r3.xda.cdf")
[1] TRUE

###
### Now making CDF enviroment packages using
### 1. the text CDF file
### 2. the binary CDF file
###

> library(makecdfenv)
> make.cdf.package("HuGene-1_0-st-v1.r3.xda.cdf",species="Homo_sapiens")
Creating package in /tmp/hugene10stv1.r3.xdacdf 

README PLEASE:
A source package has now been produced in
/tmp/hugene10stv1.r3.xdacdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install
packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'
for more information.

Alternatively, you could use make.cdf.env(), which will not require you
to install a package.
However, this environment will only persist for the current R session
unless you save() it.

> make.cdf.package("HuGene-1_0-st-v1.r3.cdf",species="Homo_sapiens")

Reading CDF file.
Creating CDF environment
Wait for about 323
dots................................................................................................................................................................................................................................................................................................................................................
Creating package in /tmp/hugene10stv1.r3cdf 

README PLEASE:
A source package has now been produced in
/tmp/hugene10stv1.r3cdf.
Before using this package it must be installed via 'R CMD INSTALL'
at a terminal prompt (or DOS command shell).
If you are using Windows, you will need to get set up to install
packages.
See the 'R Installation and Administration' manual, specifically
Section 6 'Add-on Packages' as well as 'Appendix E: The Windows Toolset'
for more information.

Alternatively, you could use make.cdf.env(), which will not require you
to install a package.
However, this environment will only persist for the current R session
unless you save() it.

###
### Now loading in the created environments
###

> load("/tmp/hugene10stv1.r3.xdacdf/data/hugene10stv1.r3.xdacdf.rda")
> load("/tmp/hugene10stv1.r3cdf/data/hugene10stv1.r3cdf.rda")

###
### Checking that dimensions are the same
###

> as.list(hugene10stv1.r3dim)
$NROW
[1] 1050

$NCOL
[1] 1050

> as.list(hugene10stv1.r3.xdadim)
$NROW
[1] 1050

$NCOL
[1] 1050

### 
### See if the same probesets are recorded in each environment
###

> order.xda <-  order(names(as.list(hugene10stv1.r3.xdacdf)))
> order.text <- order(names(as.list(hugene10stv1.r3cdf)))

> length(order.xda)
[1] 33252
> length(order.text)
[1] 32321

###
### So we see that fewer probesets made it through via the text cdf file
### next lets check how many probes are in the probesets that are in
both
###

> which.are.missing <- !
is.element( names(as.list(hugene10stv1.r3.xdacdf))[order.xda],names(as.list(hugene10stv1.r3cdf))[order.text])
> sum(which.are.missing)
[1] 931

> non.missing.xda <- do.call("rbind",
as.list(hugene10stv1.r3.xdacdf)[order.xda][!which.are.missing])
> non.missing.text <- do.call("rbind",
as.list(hugene10stv1.r3cdf)[order.text])
> 
> dim(non.missing.xda)
[1] 817537      2
> dim(non.missing.text)
[1] 788012      2

###
### Check how many in the XDA are unique
###
> length(unique(non.missing.xda[,1]))
[1] 788012

###
### Make sure that these are all in the text environment
###

> sum(is.element(unique(non.missing.xda[,1]),non.missing.text))
[1] 788012

###
### Check that all the probes that were part of the probesets that
completely disappeared
### are in fact in other probesets
###

> missing.xda <- do.call("rbind",
as.list(hugene10stv1.r3.xdacdf)[order.xda][which.are.missing])
> dim(missing.xda)
[1] 27013     2

> length(unique(missing.xda[,1]))
[1] 17413
> sum(is.element(unique(missing.xda[,1]), non.missing.xda))
[1] 17413

### 
### just for comparison purposes choose a random probeset and
### check it agrees
###

> as.list(hugene10stv1.r3.xdacdf)[order.xda][!
which.are.missing][[10001]]
           pm            mm
 [1,]  966644 3.871160e-310
 [2,]  268287 1.697597e-313
 [3,]  793133 3.654716e-310
 [4,] 1051493 1.909796e-313
 [5,] 1014750 3.548828e-310
 [6,]  768945 2.121996e-313
 [7,]  616036 3.871159e-310
 [8,]  266233 2.334195e-313
 [9,] 1045427 3.871157e-310
[10,]   82993 2.546395e-313
[11,]  529925 4.576934e-310
[12,]  301857 2.758595e-313
[13,]  536315 3.654715e-310
[14,] 1035910 2.970794e-313
[15,]  225705 1.075218e-307
[16,]   55445 3.182994e-313
[17,] 1011662 3.548828e-310
[18,]  622627 3.395193e-313
[19,]  707461 3.654716e-310
[20,] 1000636 3.607393e-313
[21,]  803972 4.576936e-310
[22,]  195338 3.819592e-313
[23,]  746710 3.871160e-310
[24,]  593069 4.031792e-313

>  as.list(hugene10stv1.r3cdf)[order.text][[10001]]
           pm mm
 [1,]  966644 NA
 [2,]  268287 NA
 [3,]  793133 NA
 [4,] 1051493 NA
 [5,] 1014750 NA
 [6,]  768945 NA
 [7,]  616036 NA
 [8,]  266233 NA
 [9,] 1045427 NA
[10,]   82993 NA
[11,]  529925 NA
[12,]  301857 NA
[13,]  536315 NA
[14,] 1035910 NA
[15,]  225705 NA
[16,]   55445 NA
[17,] 1011662 NA
[18,]  622627 NA
[19,]  707461 NA
[20,] 1000636 NA
[21,]  803972 NA
[22,]  195338 NA
[23,]  746710 NA
[24,]  593069 NA

> sessionInfo()
R version 2.7.0 Under development (unstable) (2007-10-21 r43232) 
x86_64-unknown-linux-gnu 

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets
methods  
[8] base     

other attached packages:
[1] makecdfenv_1.17.0    affy_1.17.0          preprocessCore_1.1.2
[4] affyio_1.7.1         Biobase_1.17.4       affxparser_1.7.5    

loaded via a namespace (and not attached):
[1] rcompgen_0.1-17

###
###
### Now lets load some HuGene Data
### using the two different cdfenv 
### the try to rma()
###

> setwd("/tmp/HuGene")

> HuGene.Data.text <-
ReadAffy(filenames=list.celfiles(),cdfname="hugene10stv1.r3cdf")
> 
> 
> eset.text <- rma(HuGene.Data.text)
Background correcting
Normalizing
Calculating Expression
> 
> HuGene.Data.xda <-
ReadAffy(filenames=list.celfiles(),cdfname="hugene10stv1.r3.xdacdf")
> z
> eset.xda <- rma(HuGene.Data.xda)
Error in .local(object, ...) : subscript out of bounds
In addition: Warning message:
In .local(object, ...) : NAs introduced by coercion

###
### this doesn't look good. But where is it actually dieing??
###
### it is actually this line here
###
###  exprs <- .Call("rma_c_complete", pm(object, subset), mm(object, 
        subset), probeNames(object, subset), ngenes, body(bg.dens), 
        new.env(), normalize, background, bgversion, verbose, 
        PACKAGE = "affy")
###
### and specifically the mm() call
###
> mm(HuGene.Data.xda)[1:10,]
Error in .local(object, ...) : subscript out of bounds
In addition: Warning message:
In .local(object, ...) : NAs introduced by coercion

###
### The problem is because the xda CDF parser above didn't correctly set
### the mapping for the MM to NA like it should have
###
### Two proposed fixes
### 1) Fix how the CDFenv is created for XDA format files 
### 2) remove the mm() from what is passed down to the "C" code for rma.
It is really
###    something legacy and no longer used.
###

### real quick fix (for this session only)

> mm <- pm
> eset.xda <- rma(HuGene.Data.xda)
Background correcting
Normalizing
Calculating Expression

###
### Check that each has the appropriate number of probesets
### (number of rows)

>  dim(exprs(eset.xda))
[1] 33252    33
>  dim(exprs(eset.text))
[1] 32321    33

###
### what about the number of PM probes?
###

> dim(pm(HuGene.Data.xda))
[1] 844550     33
> dim(pm(HuGene.Data.text))
[1] 788012     33

###
###
### Now lets do something where positional information matters more
###  (ie QC with affyPLM)
###

> library(affyPLM)

> Pset.text <- fitPLM(HuGene.Data.text)
> Pset.xda <- fitPLM(HuGene.Data.xda)

> image(Pset.text,which=15)
> image(Pset.xda,which=15)
Warning messages:
1: In weightmatrix[xycoor] <- x at weights[[1]][, i] :
  number of items to replace is not a multiple of replacement length
2: In weightmatrix[xycoor2] <- x at weights[[1]][, i] :
  number of items to replace is not a multiple of replacement length

### it fails, because multiple mappings of residuals to locations

> dim(weights(Pset.text)$PM)
[1] 788012     33
> dim(weights(Pset.xda)$PM)
[1] 844550     33

On Fri, 2007-10-26 at 10:05 -0400, James W. MacDonald wrote:
> Hi Yongde Bao,
> 
> There is a new package xps in the Devel repository that is intended for 
> the analysis of the gene arrays. To use this package you will need 
> R-devel, as well as installing ROOT. There are clear instructions for 
> building the package on *NIX as well as MacOSX, but if you want to run 
> on Windows you will have to figure things out for yourself.
> 
> In addition, someone has already posted links to cdf packages for this 
> chip that will allow you to use the affy package. Note however that some 
> of the probes on this chip are shared between probesets, and that was 
> not true in the past, so for a certain percentage of probesets you will 
> be missing probes (once a probe is allocated to a probeset, it will not 
> be available for any other probesets).
> 
> Best,
> 
> Jim
> 
> 
> 
> 
> Yongde Bao wrote:
> > Hi all,
> > 
> > I wonder if there is any package in Bioconductor that deals with the analysis 
> > of the Affy Human ST gene arrays. If so can someone point out where to get it?
> > 
> > Best regards,
> > 
> > Yongde Bao
> > University of Virginia
> > 
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>