[BioC] expresso: performing RMA on NON-Affy data?

Mon Apr 27 11:11:55 CEST 2009

Robert Castelo wrote:
> hi Jim,
> 
> the reason is that i'm teaching a course on microarray data analysis to
> students who are not familiar with statistics beyond the basic
> descriptive ones. in front of such audience it has been helpful for me
> to simulate some data and apply to it the corresponding analysis
> technique when illustrating how the technique works (after that, then we
> use it on real data). by simulating data, people sees explicitly the
> assumptions made behind the mechanism generating these data so that a
> fraction of them (which makes me already happy) gets to understand why a
> particular method works better than other one.

The work presented in "Preferred analysis methods for Affymetrix 
GeneChips revealed by a wholly defined control dataset - Genome Biology 
2005 - Choe et al." could be an interesting starting point (and dataset).

Otherwise, did you check the package biocDatasets ? It is still very 
much work-in-progress, but it should already provide a framework for 
such simulations.

> in the particular case below i'd like to make the point on why the
> median polish summarization method works better than the taking the
> arithmetic mean, illustrating somehow what you wisely said about the
> mean being not robust to outliers but being uniformly more powerful for
> Gaussian data, etc etc.
> 
> i know i can download lots of real data, but i don't know how could i
> demonstrate that a summarization method is better than other one with
> real data.

Spike-ins datasets and graphical representation have mostly been used to 
demonstrate the efficiency of processing methods. An early reference is
"A benchmark for Affymetrix GeneChip expression measures - 
Bioinformatics 2003 - Leslie Cope et al.", but there are several more 
recent ones. Correlation between "expected" values and the actual values 
after processing is often used.

> using some QC technique (MA plots..) ?? i'll appreciate any
> idea about this too !! :)

L.

> thanks!
> robert.
> 
> On Fri, 2009-04-24 at 14:35 -0400, James W. MacDonald wrote:
>> Hi Robert,
>>
>> Why would you want to simulate? There have to be hundreds of datasets 
>> that have celfiles on GEO alone - if you are interested in this subject 
>> why not just get some real data and see which method is better?
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> Robert Castelo wrote:
>>> hi Jim,
>>>
>>> thanks for clearing this up for me. do you know then what would be the
>>> way of simulating the more realistic situation you describe? (such that
>>> we would see how median polish outperforms averaging in the
>>> summarization step)
>>>
>>> cheers,
>>> robert.
>>>
>>>
>>> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote:
>>>> Hi Robert,
>>>>
>>>> Robert Castelo wrote:
>>>>> dear Mark,
>>>>>
>>>>> i've followed with interest your example and i found something a bit
>>>>> counter intuitive to me, let me run your code again with a seed so that
>>>>> we can all reproduce the numbers:
>>>>>
>>>>> nsamples <- 10
>>>>> nprobes <- 5
>>>>> set.seed(123)
>>>>> chipEffect <- rnorm(nsamples, mean=6)
>>>>> names(chipEffect) <- paste("e",1:nsamples,sep="")
>>>>> probeEffect <- rnorm(nprobes)
>>>>> Y <- outer(chipEffect, probeEffect, "+")   # probe effect
>>>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
>>>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
>>>>> paste("p",1:nprobes,sep=""))
>>>>>
>>>>> Y
>>>>>           p1       p2       p3       p4       p5
>>>>> e1  6.842297 5.630669 5.909160 5.437896 5.035330
>>>>> e2  7.043689 6.213415 6.225986 5.840217 5.059106
>>>>> e3  8.586128 7.933859 7.953289 7.622725 7.061329
>>>>> e4  7.364726 6.316509 6.440684 6.259188 5.527053
>>>>> e5  7.306090 6.614483 6.492012 6.231634 5.595041
>>>>> e6  8.832364 8.117525 8.046366 7.851080 7.197188
>>>>> e7  7.663201 6.791223 6.840896 6.568744 5.854843
>>>>> e8  5.856420 5.184265 5.009171 4.841334 4.145777
>>>>> e9  6.464340 5.760774 5.930814 5.560690 4.655448
>>>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318
>>>>>
>>>>> f <- medpolish(Y)
>>>>>
>>>>> summaryvalues <- f$overall + f$row
>>>>> summaryvalues
>>>>>       e1       e2       e3       e4       e5       e6       e7       e8 
>>>>> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652 
>>>>>       e9      e10 
>>>>> 5.760774 5.911843 
>>>>>
>>>>> chipEffect
>>>>>       e1       e2       e3       e4       e5       e6       e7       e8 
>>>>> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939 
>>>>>       e9      e10 
>>>>> 5.313147 5.554338 
>>>>>
>>>>> but if we actually calculate the mean squared error (MSE) of the
>>>>> previous estimates with respect to the true chip effect and compare it
>>>>> with the MSE of the estimate obtained by simply averaging the
>>>>> observations across probes:
>>>>>
>>>>> sum((summaryvalues-chipEffect)^2)
>>>>> [1] 1.528436
>>>>> sum((rowMeans(Y)-chipEffect)^2)
>>>>> [1] 0.9438652
>>>>>
>>>>>
>>>>> averaging seems to me to work better than median polish, am i missing
>>>>> something here?
>>>> Yes. You are missing the fact that the data from Affy probes usually are 
>>>> not normally distributed. In fact, it is not uncommon for a given 
>>>> probeset to have widely divergent intensity levels for its component 
>>>> probes. Because of the fact that the mean is not robust to outliers, 
>>>> people long ago abandoned methods based on a normal distribution.
>>>>
>>>> So yeah, in your case with noisy normally distributed data, a mean 
>>>> decomposition is more powerful than a median decomposition (it is UMP 
>>>> for normally distributed data after all), but in practice it will be 
>>>> pretty ugly.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>> thanks!
>>>>> robert.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
>>>>>> Jose,
>>>>>>
>>>>>> Can I add a (possibly naive) suggestion?
>>>>>>
>>>>>> You say normalization is not the issue, its the summarization of 3-8  
>>>>>> probes per probeset for your 1-colour Nimblegen data.  So maybe you  
>>>>>> just want to fit the log-additive RMA linear model to your data.  This  
>>>>>> is akin to estimating a model with probe effects and chip effects for  
>>>>>> each probeset ... of which you are really interested in the chip  
>>>>>> effects.
>>>>>>
>>>>>> Say you were able to collect your data from a single probeset into a  
>>>>>> matrix Y (concocted example below):
>>>>>>
>>>>>> ce <- rnorm(10,mean=6)   # 10 samples
>>>>>> pe <- rnorm(5)           # 5 probes in probeset
>>>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  # add  
>>>>>> noise
>>>>>>
>>>>>> One way to do the fits in a quick and robust way is median polish:
>>>>>>
>>>>>> f <- medpolish(Y)
>>>>>>
>>>>>> ---------------
>>>>>>  > f
>>>>>>
>>>>>> Median Polish Results (Dataset: "Y")
>>>>>>
>>>>>> Overall: 6.949745
>>>>>>
>>>>>> Row Effects:
>>>>>>   [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
>>>>>>   [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
>>>>>>
>>>>>> Column Effects:
>>>>>> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01  
>>>>>> -4.283675e-01
>>>>>>
>>>>>> Residuals:
>>>>>>               [,1]      [,2]      [,3]       [,4]       [,5]
>>>>>>   [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
>>>>>>   [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
>>>>>>   [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
>>>>>>   [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
>>>>>>   [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
>>>>>>   [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
>>>>>>   [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
>>>>>>   [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
>>>>>>   [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
>>>>>> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
>>>>>>
>>>>>>  > dim(Y)
>>>>>> [1] 10  5
>>>>>>  > f$overall + f$row  # estimated chip effects
>>>>>>   [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772  
>>>>>> 6.514582
>>>>>>   [9] 5.691860 7.430009
>>>>>>  > ce                 # true chip effects
>>>>>>   [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525  
>>>>>> 6.101933
>>>>>>   [9] 5.117349 6.905161
>>>>>> ---------------
>>>>>>
>>>>>> quick sketch ... it would be (fairly) easy to split up your table of  
>>>>>> log-transformed normalized probe intensities into a matrix for each  
>>>>>> probeset (e.g. using split() ...), robustly fit the model, extract the  
>>>>>> chip effects and whammo, there is your table of summarized expression  
>>>>>> values ... this would only be a few lines of R code and probably not  
>>>>>> too inefficient (?) ... this is effectively what goes on under the  
>>>>>> hood of affy::rma() and the like (although it uses C code and in a  
>>>>>> very general way that uses CDF environments).
>>>>>>
>>>>>> I suspect you could use the 'oligo' package to do much the same thing,  
>>>>>> after using pdInfoBuilder() to correctly assign probes to probesets ...
>>>>>>
>>>>>> ... anyways, just some thoughts.
>>>>>>
>>>>>> Cheers,
>>>>>> Mark
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
>>>>>>
>>>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
>>>>>>>
>>>>>>>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
>>>>>>>>
>>>>>>>>> Hi Jose,
>>>>>>>>>
>>>>>>>>> Do you want to do RMA, or just normalize? The problem with trying to
>>>>>>>>> wedge things into an AffyBatch is that the affy package will then  
>>>>>>>>> try
>>>>>>>>> to find the cdfenv that contains the probe to probeset mappings,  
>>>>>>>>> so by
>>>>>>>>> trying to leverage the AffyBatch infrastructure you will have to  
>>>>>>>>> also
>>>>>>>>> come up with a fake cdf.
>>>>>>>>>
>>>>>>>>> If you don't have probes that make up a probeset, then I'm not  
>>>>>>>>> sure the
>>>>>>>>> affy package will be of use here.
>>>>>>>>>
>>>>>>>>> Can you give more details?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>>
>>>>>>>>> Jim
>>>>>>>> Hi Jim,
>>>>>>>>
>>>>>>>> normalisation is not an issue, it's more to do with the  
>>>>>>>> summarisation of probesets and something like 'Expresso' seems like  
>>>>>>>> a good way to do what I need (and some other things I don't need).
>>>>>>>>
>>>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome  
>>>>>>>> promoter arrays, with anything up to 20 probes per probeset), and  
>>>>>>>> one colour "a la Affymetrix" (expression arrays, with anything  
>>>>>>>> between 3 to 8 probes per probeset).
>>>>>>>>
>>>>>>>> I've been dealing with teh two colour stuff just like I used to  
>>>>>>>> deal with my spotted cDNA arrays, using Limma. To summarise the  
>>>>>>>> data... I've used several approaches. Mostly I am not interested in  
>>>>>>>> the whole 2.7kb that each "promoter region" comprises, so I've  
>>>>>>>> taken subsets blah blah... Anyway, I'm happy with the results there.
>>>>>>>> But for the expression data, I have one channel data, just like  
>>>>>>>> Affy data. Numblegen provides already normalised and summarised  
>>>>>>>> data along with the raw data, and they state they use the RMA  
>>>>>>>> procedure which I've come across with when readingabout Affy chips,  
>>>>>>>> although I've never analysed Affy data myself.
>>>>>>>>
>>>>>>>> I'm reasonably happy with the data given to me. It looks  
>>>>>>>> reasonable. So I want to be able to do that myself rather than  
>>>>>>>> depending on their data (thus allowing me to do things a bit  
>>>>>>>> differently if I want to), and since the RMA-processed data looks  
>>>>>>>> good, I am interested in finding a way to do RMA myself.
>>>>>>>>
>>>>>>>> You're right, the problem with my trying to make an AffyBatch from  
>>>>>>>> non Affy data is that I'm going to have to create a cdf-like  
>>>>>>>> file... and probably will encounter other obstacles... that's why I  
>>>>>>>> thought I'd ask here, as there's people who are very familiar with  
>>>>>>>> that structure...
>>>>>>>>
>>>>>>>> In my naivety, it seems it should be a simple enough task... and as  
>>>>>>>> I'm using 4 types of arrays mostly... I'd only have to do some work  
>>>>>>>> to make these work and then just enjoy the ride as new experiments  
>>>>>>>> roll in...
>>>>>>>> Am I naive? ;-)
>>>>>>> It is pretty simple to do what you want, but "simple" is of course  
>>>>>>> in the eye of the beholder - it depends on how familiar you are with  
>>>>>>> the affy structures from a development point of view.
>>>>>>>
>>>>>>> I am not familiar with Nimblegen, but that might be much easier, as  
>>>>>>> in working out of the box.
>>>>>>>
>>>>>>> Kasper
>>>>>>>
>>>>>>>
>>>>>>>> I hope I clarified enough what I'm after.
>>>>>>>>
>>>>>>>> Jose
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
>>>>>>>> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131  
>>>>>>>> 6513374
>>>>>>>> Institute for Cell & Molecular Biology        Fax:   +44 (0)131  
>>>>>>>> 6507360
>>>>>>>> Swann Building, Mayfield Road
>>>>>>>> University of Edinburgh
>>>>>>>> Edinburgh EH9 3JR
>>>>>>>> UK
>>>>>>>>
>>>>>>>> -- 
>>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>>> Scotland, with registration number SC005336.
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> ------------------------------
>>>>>> Mark Robinson
>>>>>> Epigenetics Laboratory, Garvan
>>>>>> Bioinformatics Division, WEHI
>>>>>> e: m.robinson at garvan.org.au
>>>>>> e: mrobinson at wehi.edu.au
>>>>>> p: +61 (0)3 9345 2628
>>>>>> f: +61 (0)3 9347 0852
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor