[BioC] predict vsn with reference

Tue Oct 2 10:29:12 CEST 2007

I'm having difficulty using the 'reference' argument of vsn to put data 
from a new microarray onto the scale of an existing set of arrays, when 
all the arrays are normalised using a shared set of controls.

I think it's not understanding the way offsets are handled- predicted 
values for the data used to create a vsn object are different from the 
values stored in that vsn object when a reference is used. e.g. if I 
have data from 2 arrays in 'a' and want to put array b back onto their 
scale, this is what I'm doing:

library(vsn)
set.seed(214)
vals<-runif(1000)
a<-matrix(rep(vals,2)+0.1*rnorm(2000),1000,2)
b<-vals+0.1*rnorm(1000)
aVsn<-vsn2(a)
bVsn<-vsn2(b,reference=aVsn)

the values stored in bVsn are now on the same scale as the 'a' arrays:

plot(exprs(aVsn)[,2],exprs(bVsn)); abline(0,1)

however, the predictions from bVsn, using the data b are offset from 
these values:

plot(exprs(bVsn),predict(bVsn,b)); abline(0,1)

This is an issue when these comparable spots are only a reference set of 
probes for a larger array:

aFull<-rbind(a,matrix(runif(20000),10000,2))
bFull<-c(b,runif(10000))

I've been calculating values for the 'a' arrays using:

aFullVal<-predict(aVsn,aFull)

but if I use the same approach for the b array I cease to be on the same 
scale as the 'a' arrays:

bFullVal<-predict(bVsn,bFull)

plot(aFullVal[1:1000,1],bFullVal[1:1000,1]); abline(0,1)

I can get back to the scale by subtracting the difference:

offset<-mean(exprs(bVsn)-predict(bVsn,b))
bFullVal2<-bFullVal+offset
plot(aFullVal[1:1000,1],bFullVal2[1:1000,1]); abline(0,1)

But I don't really understand what this offset is or where it comes from 
(particularly in this toy example where the offset is much larger than 
any real difference between a and b, though I guess I haven't put in 
anything that actually needs variance stabilisation).

So it would be good to know i) whether subtraction of whatever the 
offset turns out to be is a reasonable approach (especially when b 
actually comprises several arrays)? and ii) Is there any less arbitrary 
way I can calculate values for array b while keeping on the scale of the 
'a' arrays (e.g. using parameter values directly)?

Any help much appreciated,

Chris

> sessionInfo()
R version 2.5.1 (2007-06-27)
i486-pc-linux-gnu

locale:
LC_CTYPE=en_GB.UTF-8;LC_NUMERIC=C;LC_TIME=en_GB.UTF-8;LC_COLLATE=en_GB.UTF-8;LC_MONETARY=en_GB.UTF-8;LC_MESSAGES=en_GB.UTF-8;LC_PAPER=en_GB.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_GB.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] "tools"     "stats"     "graphics"  "grDevices" "utils"     "datasets"
[7] "methods"   "base"

other attached packages:
     vsn    limma     affy   affyio  Biobase
 "2.2.0" "2.10.5" "1.14.2"  "1.4.1" "1.14.1"

-- 
------------------------------------------------------------------------
Dr Christopher Knight             Manchester Interdisciplinary Biocentre
room 2.001                                  The University of Manchester
Tel:  +44 (0)161 3065138                             131 Princess Street
Fax:  +44 (0)161 3064556                               Manchester M1 7DN
chris.knight at manchester.ac.uk                                         UK 
www.dbkgroup.org/MCISB/people/knight/                   ` · . ,,><(((°>