[BioC] expresso: performing RMA on NON-Affy data?

Fri Apr 24 14:33:50 CEST 2009

Hi Robert,

Robert Castelo wrote:
> dear Mark,
> 
> i've followed with interest your example and i found something a bit
> counter intuitive to me, let me run your code again with a seed so that
> we can all reproduce the numbers:
> 
> nsamples <- 10
> nprobes <- 5
> set.seed(123)
> chipEffect <- rnorm(nsamples, mean=6)
> names(chipEffect) <- paste("e",1:nsamples,sep="")
> probeEffect <- rnorm(nprobes)
> Y <- outer(chipEffect, probeEffect, "+")   # probe effect
> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
> dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
> paste("p",1:nprobes,sep=""))
> 
> Y
>           p1       p2       p3       p4       p5
> e1  6.842297 5.630669 5.909160 5.437896 5.035330
> e2  7.043689 6.213415 6.225986 5.840217 5.059106
> e3  8.586128 7.933859 7.953289 7.622725 7.061329
> e4  7.364726 6.316509 6.440684 6.259188 5.527053
> e5  7.306090 6.614483 6.492012 6.231634 5.595041
> e6  8.832364 8.117525 8.046366 7.851080 7.197188
> e7  7.663201 6.791223 6.840896 6.568744 5.854843
> e8  5.856420 5.184265 5.009171 4.841334 4.145777
> e9  6.464340 5.760774 5.930814 5.560690 4.655448
> e10 6.715916 5.996310 6.075906 5.642444 4.891318
> 
> f <- medpolish(Y)
> 
> summaryvalues <- f$overall + f$row
> summaryvalues
>       e1       e2       e3       e4       e5       e6       e7       e8 
> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 
>       e9      e10 
> 5.760774 5.911843 
> 
> chipEffect
>       e1       e2       e3       e4       e5       e6       e7       e8 
> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 
>       e9      e10 
> 5.313147 5.554338 
> 
> but if we actually calculate the mean squared error (MSE) of the
> previous estimates with respect to the true chip effect and compare it
> with the MSE of the estimate obtained by simply averaging the
> observations across probes:
> 
> sum((summaryvalues-chipEffect)^2)
> [1] 1.528436
> sum((rowMeans(Y)-chipEffect)^2)
> [1] 0.9438652
> 
> 
> averaging seems to me to work better than median polish, am i missing
> something here?

Yes. You are missing the fact that the data from Affy probes usually are 
not normally distributed. In fact, it is not uncommon for a given 
probeset to have widely divergent intensity levels for its component 
probes. Because of the fact that the mean is not robust to outliers, 
people long ago abandoned methods based on a normal distribution.

So yeah, in your case with noisy normally distributed data, a mean 
decomposition is more powerful than a median decomposition (it is UMP 
for normally distributed data after all), but in practice it will be 
pretty ugly.

Best,

Jim

> 
> thanks!
> robert.
> 
> 
> 
> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
>> Jose,
>>
>> Can I add a (possibly naive) suggestion?
>>
>> You say normalization is not the issue, its the summarization of 3-8  
>> probes per probeset for your 1-colour Nimblegen data.  So maybe you  
>> just want to fit the log-additive RMA linear model to your data.  This  
>> is akin to estimating a model with probe effects and chip effects for  
>> each probeset ... of which you are really interested in the chip  
>> effects.
>>
>> Say you were able to collect your data from a single probeset into a  
>> matrix Y (concocted example below):
>>
>> ce <- rnorm(10,mean=6)   # 10 samples
>> pe <- rnorm(5)           # 5 probes in probeset
>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  # add  
>> noise
>>
>> One way to do the fits in a quick and robust way is median polish:
>>
>> f <- medpolish(Y)
>>
>> ---------------
>>  > f
>>
>> Median Polish Results (Dataset: "Y")
>>
>> Overall: 6.949745
>>
>> Row Effects:
>>   [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
>>   [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
>>
>> Column Effects:
>> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01  
>> -4.283675e-01
>>
>> Residuals:
>>               [,1]      [,2]      [,3]       [,4]       [,5]
>>   [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
>>   [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
>>   [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
>>   [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
>>   [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
>>   [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
>>   [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
>>   [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
>>   [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
>> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
>>
>>  > dim(Y)
>> [1] 10  5
>>  > f$overall + f$row  # estimated chip effects
>>   [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772  
>> 6.514582
>>   [9] 5.691860 7.430009
>>  > ce                 # true chip effects
>>   [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525  
>> 6.101933
>>   [9] 5.117349 6.905161
>> ---------------
>>
>> quick sketch ... it would be (fairly) easy to split up your table of  
>> log-transformed normalized probe intensities into a matrix for each  
>> probeset (e.g. using split() ...), robustly fit the model, extract the  
>> chip effects and whammo, there is your table of summarized expression  
>> values ... this would only be a few lines of R code and probably not  
>> too inefficient (?) ... this is effectively what goes on under the  
>> hood of affy::rma() and the like (although it uses C code and in a  
>> very general way that uses CDF environments).
>>
>> I suspect you could use the 'oligo' package to do much the same thing,  
>> after using pdInfoBuilder() to correctly assign probes to probesets ...
>>
>> ... anyways, just some thoughts.
>>
>> Cheers,
>> Mark
>>
>>
>>
>>
>>
>>
>>
>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
>>
>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
>>>
>>>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
>>>>
>>>>> Hi Jose,
>>>>>
>>>>> Do you want to do RMA, or just normalize? The problem with trying to
>>>>> wedge things into an AffyBatch is that the affy package will then  
>>>>> try
>>>>> to find the cdfenv that contains the probe to probeset mappings,  
>>>>> so by
>>>>> trying to leverage the AffyBatch infrastructure you will have to  
>>>>> also
>>>>> come up with a fake cdf.
>>>>>
>>>>> If you don't have probes that make up a probeset, then I'm not  
>>>>> sure the
>>>>> affy package will be of use here.
>>>>>
>>>>> Can you give more details?
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>> Hi Jim,
>>>>
>>>> normalisation is not an issue, it's more to do with the  
>>>> summarisation of probesets and something like 'Expresso' seems like  
>>>> a good way to do what I need (and some other things I don't need).
>>>>
>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome  
>>>> promoter arrays, with anything up to 20 probes per probeset), and  
>>>> one colour "a la Affymetrix" (expression arrays, with anything  
>>>> between 3 to 8 probes per probeset).
>>>>
>>>> I've been dealing with teh two colour stuff just like I used to  
>>>> deal with my spotted cDNA arrays, using Limma. To summarise the  
>>>> data... I've used several approaches. Mostly I am not interested in  
>>>> the whole 2.7kb that each "promoter region" comprises, so I've  
>>>> taken subsets blah blah... Anyway, I'm happy with the results there.
>>>> But for the expression data, I have one channel data, just like  
>>>> Affy data. Numblegen provides already normalised and summarised  
>>>> data along with the raw data, and they state they use the RMA  
>>>> procedure which I've come across with when readingabout Affy chips,  
>>>> although I've never analysed Affy data myself.
>>>>
>>>> I'm reasonably happy with the data given to me. It looks  
>>>> reasonable. So I want to be able to do that myself rather than  
>>>> depending on their data (thus allowing me to do things a bit  
>>>> differently if I want to), and since the RMA-processed data looks  
>>>> good, I am interested in finding a way to do RMA myself.
>>>>
>>>> You're right, the problem with my trying to make an AffyBatch from  
>>>> non Affy data is that I'm going to have to create a cdf-like  
>>>> file... and probably will encounter other obstacles... that's why I  
>>>> thought I'd ask here, as there's people who are very familiar with  
>>>> that structure...
>>>>
>>>> In my naivety, it seems it should be a simple enough task... and as  
>>>> I'm using 4 types of arrays mostly... I'd only have to do some work  
>>>> to make these work and then just enjoy the ride as new experiments  
>>>> roll in...
>>>> Am I naive? ;-)
>>> It is pretty simple to do what you want, but "simple" is of course  
>>> in the eye of the beholder - it depends on how familiar you are with  
>>> the affy structures from a development point of view.
>>>
>>> I am not familiar with Nimblegen, but that might be much easier, as  
>>> in working out of the box.
>>>
>>> Kasper
>>>
>>>
>>>> I hope I clarified enough what I'm after.
>>>>
>>>> Jose
>>>>
>>>>
>>>>
>>>> -- 
>>>> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
>>>> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131  
>>>> 6513374
>>>> Institute for Cell & Molecular Biology        Fax:   +44 (0)131  
>>>> 6507360
>>>> Swann Building, Mayfield Road
>>>> University of Edinburgh
>>>> Edinburgh EH9 3JR
>>>> UK
>>>>
>>>> -- 
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> ------------------------------
>> Mark Robinson
>> Epigenetics Laboratory, Garvan
>> Bioinformatics Division, WEHI
>> e: m.robinson at garvan.org.au
>> e: mrobinson at wehi.edu.au
>> p: +61 (0)3 9345 2628
>> f: +61 (0)3 9347 0852
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826