[BioC] vsn and scaling

Thu Oct 4 00:16:07 CEST 2007

Dear Roslin,

I have now had a look at your data (which you sent me off-list). The 
quick bottomline for hurried listreaders is: consider the parameter 
"lts.quantile" and try different choices from the default, especially if 
you are looking at quite variable data!

Now in detail. There are two issues:

1. your two datasets not only differ in the order of the arrays 
(columns), but also in the set of features (rows of the data matrix): 
the first one has 3884, the second has 3746. Out of these, 3588 are in 
common,  296 are only in the first and 158 are only in the second.

2. the data appear relatively noisy, and hence the precision of the 
estimated normalization parameters is not great.

For the purpose of comparison, (1) can be addressed by fitting vsn only 
to the common features (you can then apply the fit to all using the 
predict method).

(2) can often be addressed by varying the parameter "lts.quantile". Its 
defafult value 0.9 in vsn2 is probably too optimistic for your data (it 
assumes less variability between arrays), and in the example code below 
I set it to 0.5. With that, the results for R10001 and R10002 are very 
similar (see commonfeatures.png).

The output of vsn2 can depend sensitively on the input (e.g. the order 
of the arrays or the addition/omission of a few rows) BUT if that 
happens, it indicates a problem, either because the vsn model is not 
appropriate for the data or the numerical optimisation algorithm failed 
to find the optimal fit. In such cases, tuning of the parameters (in 
particular, lts.quantile and pstart) can help a lot.

Then, in some cases (and I think this applies to yours), the results can 
seemingly differ, but they could both be within the range of the 
plausible. Think of them as two different estimates for the same 
underlying parameter, which are both consistent with the available data 
and the model. This can happen if the likelihood landscape is very 
shallow / the model is overparameterized. In particular, this will 
happen for vsn2 when the relative size of additive noise is small, 
"glog" is overkill, and "log" would do just as well.

The resulting plots ans script are also here: 
http://www.ebi.ac.uk/~huber/pub/roslin

I hope this helps, please let me know.

Best wishes
	Wolfgang

---------------------------------------------------------------
library("vsn")

x1 = as.matrix(read.table("R10001.vsn.filtered.raw.txt", sep="\t", 
quote="", header=TRUE, row.names=1L))
x2 = as.matrix(read.table("R10002.vsn.filtered.raw.txt", sep="\t", 
quote="", header=TRUE, row.names=1L))

common = intersect(rownames(x1), rownames(x2))
onlyx1 = setdiff(rownames(x1), rownames(x2))
onlyx2 = setdiff(rownames(x2), rownames(x1))

cat(length(common), length(onlyx1), length(onlyx2), "\n")
## 3588 296 158

myPlot = function(a, b, name) {
   png(paste(name, "png", sep="."), width=700, height=500)
   par(mfrow=c(2,3))
   for(cn in colnames(a)){
     plot(a[common, cn], b[common, cn], pch=".", main=cn)
     abline(a=0, b=1, col="blue")
   }
   dev.off()
}

myPlot(log2(x1), log2(x2), "beforevsn")

## with all features
v1 = vsn2(x1, lts.quantile=0.5)
v2 = vsn2(x2, lts.quantile=0.5)
myPlot(exprs(v1), exprs(v2), "withallfeatures")

## with common features only
w1 = vsn2(x1[common,], lts.quantile=0.5)
w2 = vsn2(x2[common,], lts.quantile=0.5)
myPlot(exprs(w1), exprs(w2), "commonfeatures")

-------------------------------------------------------------------
 > sessionInfo()
R version 2.6.0 Patched (2007-10-03 r43075)
i686-pc-linux-gnu

locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] tools     stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] vsn_3.3.1              limma_2.11.14          affy_1.15.12
[4] preprocessCore_0.99.22 affyio_1.5.11          Biobase_1.15.36
[7] fortunes_1.3-3

loaded via a namespace (and not attached):
[1] grid_2.6.0      lattice_0.16-5  rcompgen_0.1-15
-------------------------------------------------------------------

Wolfgang Huber ha scritto:
> Dear Roslin,
> 
>> Am I correct in saying that the default setting of vsn2 is to scale 
>> everything to the first array?
> 
> This is not the case. The model that vsn fits is invariant under 
> permutation of the arrays (you can look at the vignette "Likelihood 
> Calculations for vsn" if you are interested in details).
> 
> However, the fitting does includes an iterative numeric optimisation; if 
> that fails, and runs into a local optimum, such a behaviour can depend 
> sensitively on the data (not just the order of the arrays).
> 
>> Could this be potentially problematic? For example, we have observed 
>> strange results when we change the order of loading of the data into 
>> vsn2. The resulting normalised data looks different when vieved in an 
>> MA plot.
> 
> Can you send me an example that I can reproduce? Such examples would be 
> helpful for me to improve the diagnostics that can be done by vsn2, 
> either automatically, or in a subsequent quality control step.
> 
> Best wishes
>   Wolfgang
> 
> ------------------------------------------------------------------
> Wolfgang Huber  EBI/EMBL  Cambridge UK  http://www.ebi.ac.uk/huber
>