[Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays

Tue Feb 3 20:14:57 CET 2009

Thank you, Holger, Jim, Ben,

(1) this is indeed a peculiar feature of the medianpolish part of RMA. I 
apologize to Ben for using the harsh word "bug" inappropriately in this 
context. I was misled by observations that using different normalisation 
or background correction methods seemed to "fix" the problem, and had 
forgotten about those previous threads. Sorry.

I wonder whether anyone has done research on how to improve on 
medianpolish for small datasets and why people haven't adapted it.

(2) One further potential problem remains, and I hope that bringing this 
up might help to make sure that rma (which is obviously a core component 
of Bioconductor) is robust even under the most adverse circumstances 
(e.g. myself :). With the script below, I get a core dump. This is on a 
freshly checked out and built R (from today) and all packages then 
installed from scratch (sessionInfo as in previous mail below). It does 
not seem to happen on R-2.8.1. and Bioc release 2.3., and even on R-dev, 
whether it happens seems to depend on the history of the R session. One 
hypothesis is that this could be related to the "destructive=TRUE" 
parameter of rma?

Can anyone reproduce this?

  Best wishes
	Wolfgang

library("affy")
library("genefilter")
options(error=recover)

files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL", "PP_T2_2c.CEL")

#celdir = tempdir()
#for(f in files)
#  download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
#                file.path(celdir, f))
celdir="."

a = ReadAffy(filenames=file.path(celdir, files))

## Define a CDF with one probeset per probe
testcdf = new.env(hash=TRUE)
for(i in seq_len(nrow(exprs(a))))
     assign(paste(i), matrix(i, ncol=1, nrow=1), envir=testcdf)
a at cdfName = "testcdf"

x = rma(a)

----------------------

Background correcting
Normalizing
Calculating Expression
Error in `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",  :
   attempt to set colnames on object with less than two dimensions

Enter a frame number, or 0 to exit

1: source("rma2.R")
2: eval.with.vis(ei, envir)
3: eval.with.vis(expr, envir, enclos)
4: rma(a)
5: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL", 
"PP_T2_1.CEL"

Selection:  4
Called from: eval(expr, envir, enclos)
Browse[1]>  ls()
  [1] "*tmp*"       "background"  "bgversion"   "cols"        "destructive"
  [6] "exprs"       "ngenes"      "normalize"   "object"      "pNList"
[11] "rows"        "subset"      "verbose"
Browse[1]> str(exprs)

........ long dump of text ....

496: stop("attempt to set colnames on object with less than two dimensions")
497: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL", 
"PP_T2_1.CEL", "PP_T2_2c.CEL"))
498: rma(a)
499: eval.with.vis(expr, envir, enclos)
500: eval.with.vis(ei, envir)
501: source("rma2.R")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

James W. MacDonald wrote:
> Hi Holger,
> 
> Holger Schwender wrote:
>> Hi Wolfgang,
>>
>> the difference between the two outputs x1 and x2 are due to that in
>> x1 the probe intensities are background-corrected first and then
>> normalized, while in x2 they are first normalized and then
>> background-corrected.
>>
>> The constant values are a feature of median polish: If median polish
>> is applied to four or six samples, then this can lead to constant
>> values for some of the probe sets. This has been mentioned years ago
>> by Jim (I couldn't find his original email on this topic in the
>> mailing list archive).
> 
> Actually this only occurs with a small, odd number of samples, not even. 
> Well, it could happen for a small number of probesets if the n/2 and n/2 
> + 1 ordered values were identical, but it is much more common in the 
> case of 3 or 5 samples.
> 
> Best,
> 
> Jim
> 
> 
>>
>> Best, Holger
>>
>>
>> -------- Original-Nachricht --------
>>> Datum: Tue, 03 Feb 2009 15:38:06 +0100 Von: Wolfgang Huber
>>> <huber at ebi.ac.uk> An: Bioconductor Developers
>>> <bioc-devel at stat.math.ethz.ch> Betreff: [Bioc-devel] Bug in rma:
>>> zero variance of probeset summaries across    arrays
>>
>>> Hi,
>>>
>>> I am getting puzzling results from the rma function in the affy
>>> package, and furthermore they are different depending on whether
>>> quantile normalisation is called within rma or explicitely. The
>>> puzzling part is that some probesets, in a perfectly ordinary
>>> experiment with 4 different arrays, are summarized with exactly
>>> identical values across all arrays (i.e. zero variance); while the
>>> underlying probe level data for these probesets is variable and
>>> sits right in the middle of the dynamic range (i.e. not at any of
>>> the extremes).
>>>
>>> The code: ---------
>>>
>>> library("affy") library("genefilter") library("preprocessCore") 
>>> options(error=recover)
>>>
>>> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL",
>>> "PP_T2_2c.CEL")
>>>
>>> for(f in files) 
>>> download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f), 
>>> file.path(tempdir(), f))
>>>
>>> a = ReadAffy(filenames=file.path(tempdir(), files))
>>>
>>> x1 = rma(a)
>>>
>>> aq = a exprs(aq) = normalize.quantiles(exprs(aq)) x2 = rma(aq,
>>> normalize=FALSE)
>>>
>>> probeVars = function(x) { r = rowSds(exprs(x)) ps =
>>> names(which(r==0)) cat(sprintf("%s: %d probes with zero
>>> variance.\n", deparse(substitute(x)), length(ps))) return(ps) }
>>>
>>> probeVars(a) ps = probeVars(x1) probeVars(x2)
>>>
>>> print(exprs(a)[yeast2cdf[[ps[1]]][,1],])
>>>
>>> ----------------------------------------------------------------------------- 
>>>
>>>
>>>
>>> The output: ----------- Background correcting Normalizing Calculating 
>>> Expression Background correcting Calculating Expression
>>>
>>>
>>> a: 0 probes with zero variance. x1: 3 probes with zero variance. x2: 
>>> 0 probes with zero variance.
>>>
>>> PP_T1_1.CEL PP_T1_2.CEL PP_T2_1.CEL PP_T2_2c.CEL 35244          185
>>> 236         269          253 198153         226         314
>>> 320          213 48500          375         370         408
>>> 373 109094         125         157         201          175 144686
>>> 107         131         153          100 214110         181
>>> 296         307          234 47302          355         430
>>> 397          282 108656         117         130         150
>>> 118 147528         158         144         159          135 237691
>>> 208         244         253          169 229910          94
>>> 138         147          133
>>>
>>>> sessionInfo()
>>> R version 2.9.0 Under development (unstable) (2009-02-03 r47829) 
>>> x86_64-unknown-linux-gnu
>>>
>>> locale: 
>>> LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C 
>>>
>>>
>>>
>>> attached base packages: [1] stats     graphics  grDevices datasets
>>> utils     methods   base
>>>
>>> other attached packages: [1] yeast2cdf_2.3.0
>>> preprocessCore_1.5.3 genefilter_1.23.1 [4] affy_1.21.7
>>> Biobase_2.3.10       fortunes_1.3-6
>>>
>>> loaded via a namespace (and not attached): [1] affyio_1.11.3
>>> annotate_1.21.3      AnnotationDbi_1.5.14 [4] DBI_0.2-4
>>> RSQLite_0.7-1        splines_2.9.0 [7] survival_2.34-1
>>> tools_2.9.0          xtable_1.5-4
>>>
>>>
>>>
>>> Best wishes Wolfgang
>>>
>>>
>>> PS While trying to dissect this, I also experimented with
>>> constructing special CDF environments (e.g. one that had one
>>> 'probeset' for each probe), and I noted that repeatedly the C code
>>> of rma caused core dumps. Perhaps a better response to unexpected
>>> shapes of CDF environments would be to issue an error message. I
>>> can provide details if interested.
>>>
>>> ---------------------------------------------------- Wolfgang
>>> Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
>>>
>>> _______________________________________________ 
>>> Bioc-devel at stat.math.ethz.ch mailing list 
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>> -- 
>>
>> _______________________________________________ 
>> Bioc-devel at stat.math.ethz.ch mailing list 
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
----------------------------------------------------
Wolfgang Huber  EMBL-EBI  http://www.ebi.ac.uk/huber