[Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays

Tue Feb 3 20:08:49 CET 2009

The rma command works fine on my system - fresh svn checkout of R- 
devel from today, all relevant packages updated. I did not load  
genefilter as you do below, but aside from that it is the same.

I have only tried once, and I ran it with R CMD BATCH.

2 comments:
1) It seems to complain about colnames and an object with no  
dimension. Perhaps a drop = FALSE needs to be put in place somewhere  
in the code (but why it works for me, I have no idea about)
2) In general, I would also have put a colnames = "pm" on the probeset  
matrices you make below. Hard to know whether it is a requirement in  
all parts of the code, but seems safer.

Kasper

On Feb 3, 2009, at 11:14 , Wolfgang Huber wrote:

> Thank you, Holger, Jim, Ben,
>
> (1) this is indeed a peculiar feature of the medianpolish part of  
> RMA. I apologize to Ben for using the harsh word "bug"  
> inappropriately in this context. I was misled by observations that  
> using different normalisation or background correction methods  
> seemed to "fix" the problem, and had forgotten about those previous  
> threads. Sorry.
>
> I wonder whether anyone has done research on how to improve on  
> medianpolish for small datasets and why people haven't adapted it.
>
> (2) One further potential problem remains, and I hope that bringing  
> this up might help to make sure that rma (which is obviously a core  
> component of Bioconductor) is robust even under the most adverse  
> circumstances (e.g. myself :). With the script below, I get a core  
> dump. This is on a freshly checked out and built R (from today) and  
> all packages then installed from scratch (sessionInfo as in previous  
> mail below). It does not seem to happen on R-2.8.1. and Bioc release  
> 2.3., and even on R-dev, whether it happens seems to depend on the  
> history of the R session. One hypothesis is that this could be  
> related to the "destructive=TRUE" parameter of rma?
>
> Can anyone reproduce this?
>
> Best wishes
> 	Wolfgang
>
>
> library("affy")
> library("genefilter")
> options(error=recover)
>
> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL", "PP_T2_2c.CEL")
>
> #celdir = tempdir()
> #for(f in files)
> #  download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
> #                file.path(celdir, f))
> celdir="."
>
> a = ReadAffy(filenames=file.path(celdir, files))
>
> ## Define a CDF with one probeset per probe
> testcdf = new.env(hash=TRUE)
> for(i in seq_len(nrow(exprs(a))))
>    assign(paste(i), matrix(i, ncol=1, nrow=1), envir=testcdf)
> a at cdfName = "testcdf"
>
> x = rma(a)
>
>
> ----------------------
>
> Background correcting
> Normalizing
> Calculating Expression
> Error in `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL",  
> "PP_T1_2.CEL",  :
>  attempt to set colnames on object with less than two dimensions
>
> Enter a frame number, or 0 to exit
>
> 1: source("rma2.R")
> 2: eval.with.vis(ei, envir)
> 3: eval.with.vis(expr, envir, enclos)
> 4: rma(a)
> 5: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",  
> "PP_T2_1.CEL"
>
> Selection:  4
> Called from: eval(expr, envir, enclos)
> Browse[1]>  ls()
> [1] "*tmp*"       "background"  "bgversion"   "cols"         
> "destructive"
> [6] "exprs"       "ngenes"      "normalize"   "object"      "pNList"
> [11] "rows"        "subset"      "verbose"
> Browse[1]> str(exprs)
>
> ........ long dump of text ....
>
>
> 496: stop("attempt to set colnames on object with less than two  
> dimensions")
> 497: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",  
> "PP_T2_1.CEL", "PP_T2_2c.CEL"))
> 498: rma(a)
> 499: eval.with.vis(expr, envir, enclos)
> 500: eval.with.vis(ei, envir)
> 501: source("rma2.R")
>
> Possible actions:
> 1: abort (with core dump, if enabled)
> 2: normal R exit
> 3: exit R without saving workspace
> 4: exit R saving workspace
> Selection:
>
>
> James W. MacDonald wrote:
>> Hi Holger,
>> Holger Schwender wrote:
>>> Hi Wolfgang,
>>>
>>> the difference between the two outputs x1 and x2 are due to that in
>>> x1 the probe intensities are background-corrected first and then
>>> normalized, while in x2 they are first normalized and then
>>> background-corrected.
>>>
>>> The constant values are a feature of median polish: If median polish
>>> is applied to four or six samples, then this can lead to constant
>>> values for some of the probe sets. This has been mentioned years ago
>>> by Jim (I couldn't find his original email on this topic in the
>>> mailing list archive).
>> Actually this only occurs with a small, odd number of samples, not  
>> even. Well, it could happen for a small number of probesets if the  
>> n/2 and n/2 + 1 ordered values were identical, but it is much more  
>> common in the case of 3 or 5 samples.
>> Best,
>> Jim
>>>
>>> Best, Holger
>>>
>>>
>>> -------- Original-Nachricht --------
>>>> Datum: Tue, 03 Feb 2009 15:38:06 +0100 Von: Wolfgang Huber
>>>> <huber at ebi.ac.uk> An: Bioconductor Developers
>>>> <bioc-devel at stat.math.ethz.ch> Betreff: [Bioc-devel] Bug in rma:
>>>> zero variance of probeset summaries across    arrays
>>>
>>>> Hi,
>>>>
>>>> I am getting puzzling results from the rma function in the affy
>>>> package, and furthermore they are different depending on whether
>>>> quantile normalisation is called within rma or explicitely. The
>>>> puzzling part is that some probesets, in a perfectly ordinary
>>>> experiment with 4 different arrays, are summarized with exactly
>>>> identical values across all arrays (i.e. zero variance); while the
>>>> underlying probe level data for these probesets is variable and
>>>> sits right in the middle of the dynamic range (i.e. not at any of
>>>> the extremes).
>>>>
>>>> The code: ---------
>>>>
>>>> library("affy") library("genefilter") library("preprocessCore")  
>>>> options(error=recover)
>>>>
>>>> files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL",
>>>> "PP_T2_2c.CEL")
>>>>
>>>> for(f in files) download.file(file.path("http://www.ebi.ac.uk/~huber/pub 
>>>> ", f), file.path(tempdir(), f))
>>>>
>>>> a = ReadAffy(filenames=file.path(tempdir(), files))
>>>>
>>>> x1 = rma(a)
>>>>
>>>> aq = a exprs(aq) = normalize.quantiles(exprs(aq)) x2 = rma(aq,
>>>> normalize=FALSE)
>>>>
>>>> probeVars = function(x) { r = rowSds(exprs(x)) ps =
>>>> names(which(r==0)) cat(sprintf("%s: %d probes with zero
>>>> variance.\n", deparse(substitute(x)), length(ps))) return(ps) }
>>>>
>>>> probeVars(a) ps = probeVars(x1) probeVars(x2)
>>>>
>>>> print(exprs(a)[yeast2cdf[[ps[1]]][,1],])
>>>>
>>>> -----------------------------------------------------------------------------
>>>>
>>>>
>>>> The output: ----------- Background correcting Normalizing  
>>>> Calculating Expression Background correcting Calculating Expression
>>>>
>>>>
>>>> a: 0 probes with zero variance. x1: 3 probes with zero variance.  
>>>> x2: 0 probes with zero variance.
>>>>
>>>> PP_T1_1.CEL PP_T1_2.CEL PP_T2_1.CEL PP_T2_2c.CEL 35244          185
>>>> 236         269          253 198153         226         314
>>>> 320          213 48500          375         370         408
>>>> 373 109094         125         157         201          175 144686
>>>> 107         131         153          100 214110         181
>>>> 296         307          234 47302          355         430
>>>> 397          282 108656         117         130         150
>>>> 118 147528         158         144         159          135 237691
>>>> 208         244         253          169 229910          94
>>>> 138         147          133
>>>>
>>>>> sessionInfo()
>>>> R version 2.9.0 Under development (unstable) (2009-02-03 r47829)  
>>>> x86_64-unknown-linux-gnu
>>>>
>>>> locale:  
>>>> LC_CTYPE 
>>>> = 
>>>> it_IT 
>>>> .UTF 
>>>> -8 
>>>> ;LC_NUMERIC 
>>>> = 
>>>> C 
>>>> ;LC_TIME 
>>>> = 
>>>> it_IT 
>>>> .UTF 
>>>> -8 
>>>> ;LC_COLLATE 
>>>> = 
>>>> it_IT 
>>>> .UTF 
>>>> -8 
>>>> ;LC_MONETARY 
>>>> = 
>>>> C 
>>>> ;LC_MESSAGES 
>>>> = 
>>>> it_IT 
>>>> .UTF 
>>>> -8 
>>>> ;LC_PAPER 
>>>> = 
>>>> it_IT 
>>>> .UTF 
>>>> -8 
>>>> ;LC_NAME 
>>>> = 
>>>> C 
>>>> ;LC_ADDRESS 
>>>> =C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C
>>>>
>>>>
>>>> attached base packages: [1] stats     graphics  grDevices datasets
>>>> utils     methods   base
>>>>
>>>> other attached packages: [1] yeast2cdf_2.3.0
>>>> preprocessCore_1.5.3 genefilter_1.23.1 [4] affy_1.21.7
>>>> Biobase_2.3.10       fortunes_1.3-6
>>>>
>>>> loaded via a namespace (and not attached): [1] affyio_1.11.3
>>>> annotate_1.21.3      AnnotationDbi_1.5.14 [4] DBI_0.2-4
>>>> RSQLite_0.7-1        splines_2.9.0 [7] survival_2.34-1
>>>> tools_2.9.0          xtable_1.5-4
>>>>
>>>>
>>>>
>>>> Best wishes Wolfgang
>>>>
>>>>
>>>> PS While trying to dissect this, I also experimented with
>>>> constructing special CDF environments (e.g. one that had one
>>>> 'probeset' for each probe), and I noted that repeatedly the C code
>>>> of rma caused core dumps. Perhaps a better response to unexpected
>>>> shapes of CDF environments would be to issue an error message. I
>>>> can provide details if interested.
>>>>
>>>> ---------------------------------------------------- Wolfgang
>>>> Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
>>>>
>>>> _______________________________________________ Bioc-devel at stat.math.ethz.ch 
>>>>  mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>> -- 
>>>
>>> _______________________________________________ Bioc-devel at stat.math.ethz.ch 
>>>  mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> -- 
> ----------------------------------------------------
> Wolfgang Huber  EMBL-EBI  http://www.ebi.ac.uk/huber
>
> _______________________________________________
> Bioc-devel at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel