[BioC] expresso: performing RMA on NON-Affy data?

Fri Apr 24 10:13:12 CEST 2009

dear Mark,

i've followed with interest your example and i found something a bit
counter intuitive to me, let me run your code again with a seed so that
we can all reproduce the numbers:

nsamples <- 10
nprobes <- 5
set.seed(123)
chipEffect <- rnorm(nsamples, mean=6)
names(chipEffect) <- paste("e",1:nsamples,sep="")
probeEffect <- rnorm(nprobes)
Y <- outer(chipEffect, probeEffect, "+")   # probe effect
Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
paste("p",1:nprobes,sep=""))

Y
          p1       p2       p3       p4       p5
e1  6.842297 5.630669 5.909160 5.437896 5.035330
e2  7.043689 6.213415 6.225986 5.840217 5.059106
e3  8.586128 7.933859 7.953289 7.622725 7.061329
e4  7.364726 6.316509 6.440684 6.259188 5.527053
e5  7.306090 6.614483 6.492012 6.231634 5.595041
e6  8.832364 8.117525 8.046366 7.851080 7.197188
e7  7.663201 6.791223 6.840896 6.568744 5.854843
e8  5.856420 5.184265 5.009171 4.841334 4.145777
e9  6.464340 5.760774 5.930814 5.560690 4.655448
e10 6.715916 5.996310 6.075906 5.642444 4.891318

f <- medpolish(Y)

summaryvalues <- f$overall + f$row
summaryvalues
      e1       e2       e3       e4       e5       e6       e7       e8 
5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 
      e9      e10 
5.760774 5.911843 

chipEffect
      e1       e2       e3       e4       e5       e6       e7       e8 
5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 
      e9      e10 
5.313147 5.554338 

but if we actually calculate the mean squared error (MSE) of the
previous estimates with respect to the true chip effect and compare it
with the MSE of the estimate obtained by simply averaging the
observations across probes:

sum((summaryvalues-chipEffect)^2)
[1] 1.528436
sum((rowMeans(Y)-chipEffect)^2)
[1] 0.9438652

averaging seems to me to work better than median polish, am i missing
something here?

thanks!
robert.

On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
> Jose,
> 
> Can I add a (possibly naive) suggestion?
> 
> You say normalization is not the issue, its the summarization of 3-8  
> probes per probeset for your 1-colour Nimblegen data.  So maybe you  
> just want to fit the log-additive RMA linear model to your data.  This  
> is akin to estimating a model with probe effects and chip effects for  
> each probeset ... of which you are really interested in the chip  
> effects.
> 
> Say you were able to collect your data from a single probeset into a  
> matrix Y (concocted example below):
> 
> ce <- rnorm(10,mean=6)   # 10 samples
> pe <- rnorm(5)           # 5 probes in probeset
> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  # add  
> noise
> 
> One way to do the fits in a quick and robust way is median polish:
> 
> f <- medpolish(Y)
> 
> ---------------
>  > f
> 
> Median Polish Results (Dataset: "Y")
> 
> Overall: 6.949745
> 
> Row Effects:
>   [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
>   [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
> 
> Column Effects:
> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01  
> -4.283675e-01
> 
> Residuals:
>               [,1]      [,2]      [,3]       [,4]       [,5]
>   [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
>   [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
>   [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
>   [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
>   [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
>   [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
>   [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
>   [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
>   [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
> 
>  > dim(Y)
> [1] 10  5
>  > f$overall + f$row  # estimated chip effects
>   [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772  
> 6.514582
>   [9] 5.691860 7.430009
>  > ce                 # true chip effects
>   [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525  
> 6.101933
>   [9] 5.117349 6.905161
> ---------------
> 
> quick sketch ... it would be (fairly) easy to split up your table of  
> log-transformed normalized probe intensities into a matrix for each  
> probeset (e.g. using split() ...), robustly fit the model, extract the  
> chip effects and whammo, there is your table of summarized expression  
> values ... this would only be a few lines of R code and probably not  
> too inefficient (?) ... this is effectively what goes on under the  
> hood of affy::rma() and the like (although it uses C code and in a  
> very general way that uses CDF environments).
> 
> I suspect you could use the 'oligo' package to do much the same thing,  
> after using pdInfoBuilder() to correctly assign probes to probesets ...
> 
> ... anyways, just some thoughts.
> 
> Cheers,
> Mark
> 
> 
> 
> 
> 
> 
> 
> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
> 
> >
> > On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
> >
> >> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
> >>
> >>> Hi Jose,
> >>>
> >>> Do you want to do RMA, or just normalize? The problem with trying to
> >>> wedge things into an AffyBatch is that the affy package will then  
> >>> try
> >>> to find the cdfenv that contains the probe to probeset mappings,  
> >>> so by
> >>> trying to leverage the AffyBatch infrastructure you will have to  
> >>> also
> >>> come up with a fake cdf.
> >>>
> >>> If you don't have probes that make up a probeset, then I'm not  
> >>> sure the
> >>> affy package will be of use here.
> >>>
> >>> Can you give more details?
> >>>
> >>> Best,
> >>>
> >>> Jim
> >>
> >> Hi Jim,
> >>
> >> normalisation is not an issue, it's more to do with the  
> >> summarisation of probesets and something like 'Expresso' seems like  
> >> a good way to do what I need (and some other things I don't need).
> >>
> >> I am dealing with Nimblegen arrays. Both two colour (whole genome  
> >> promoter arrays, with anything up to 20 probes per probeset), and  
> >> one colour "a la Affymetrix" (expression arrays, with anything  
> >> between 3 to 8 probes per probeset).
> >>
> >> I've been dealing with teh two colour stuff just like I used to  
> >> deal with my spotted cDNA arrays, using Limma. To summarise the  
> >> data... I've used several approaches. Mostly I am not interested in  
> >> the whole 2.7kb that each "promoter region" comprises, so I've  
> >> taken subsets blah blah... Anyway, I'm happy with the results there.
> >> But for the expression data, I have one channel data, just like  
> >> Affy data. Numblegen provides already normalised and summarised  
> >> data along with the raw data, and they state they use the RMA  
> >> procedure which I've come across with when readingabout Affy chips,  
> >> although I've never analysed Affy data myself.
> >>
> >> I'm reasonably happy with the data given to me. It looks  
> >> reasonable. So I want to be able to do that myself rather than  
> >> depending on their data (thus allowing me to do things a bit  
> >> differently if I want to), and since the RMA-processed data looks  
> >> good, I am interested in finding a way to do RMA myself.
> >>
> >> You're right, the problem with my trying to make an AffyBatch from  
> >> non Affy data is that I'm going to have to create a cdf-like  
> >> file... and probably will encounter other obstacles... that's why I  
> >> thought I'd ask here, as there's people who are very familiar with  
> >> that structure...
> >>
> >> In my naivety, it seems it should be a simple enough task... and as  
> >> I'm using 4 types of arrays mostly... I'd only have to do some work  
> >> to make these work and then just enjoy the ride as new experiments  
> >> roll in...
> >> Am I naive? ;-)
> >
> > It is pretty simple to do what you want, but "simple" is of course  
> > in the eye of the beholder - it depends on how familiar you are with  
> > the affy structures from a development point of view.
> >
> > I am not familiar with Nimblegen, but that might be much easier, as  
> > in working out of the box.
> >
> > Kasper
> >
> >
> >> I hope I clarified enough what I'm after.
> >>
> >> Jose
> >>
> >>
> >>
> >> -- 
> >> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
> >> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131  
> >> 6513374
> >> Institute for Cell & Molecular Biology        Fax:   +44 (0)131  
> >> 6507360
> >> Swann Building, Mayfield Road
> >> University of Edinburgh
> >> Edinburgh EH9 3JR
> >> UK
> >>
> >> -- 
> >> The University of Edinburgh is a charitable body, registered in
> >> Scotland, with registration number SC005336.
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> ------------------------------
> Mark Robinson
> Epigenetics Laboratory, Garvan
> Bioinformatics Division, WEHI
> e: m.robinson at garvan.org.au
> e: mrobinson at wehi.edu.au
> p: +61 (0)3 9345 2628
> f: +61 (0)3 9347 0852
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>