[BioC] expresso: performing RMA on NON-Affy data?

Sun Apr 26 15:31:55 CEST 2009

What is meant by "convincing simulation / behaving like real data"
here ? That it would pass a sort of Turing test for microarray data ?

Simulated data can be extremely handy for a "what if" scenario.

L.

On Sat, 2009-04-25 at 09:37 -0700, Kasper Daniel Hansen wrote:
> Let me turn this around.
> 
> I have never seen a convincing simulation of microarray data. You can  
> find examples of simulated data, but I have never seen simulated data  
> behaving like real data.
> 
> Kasper
> 
> On Apr 24, 2009, at 11:35 , James W. MacDonald wrote:
> 
> > Hi Robert,
> >
> > Why would you want to simulate? There have to be hundreds of  
> > datasets that have celfiles on GEO alone - if you are interested in  
> > this subject why not just get some real data and see which method is  
> > better?
> >
> > Best,
> >
> > Jim
> >
> >
> >
> > Robert Castelo wrote:
> >> hi Jim,
> >> thanks for clearing this up for me. do you know then what would be  
> >> the
> >> way of simulating the more realistic situation you describe? (such  
> >> that
> >> we would see how median polish outperforms averaging in the
> >> summarization step)
> >> cheers,
> >> robert.
> >> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote:
> >>> Hi Robert,
> >>>
> >>> Robert Castelo wrote:
> >>>> dear Mark,
> >>>>
> >>>> i've followed with interest your example and i found something a  
> >>>> bit
> >>>> counter intuitive to me, let me run your code again with a seed  
> >>>> so that
> >>>> we can all reproduce the numbers:
> >>>>
> >>>> nsamples <- 10
> >>>> nprobes <- 5
> >>>> set.seed(123)
> >>>> chipEffect <- rnorm(nsamples, mean=6)
> >>>> names(chipEffect) <- paste("e",1:nsamples,sep="")
> >>>> probeEffect <- rnorm(nprobes)
> >>>> Y <- outer(chipEffect, probeEffect, "+")   # probe effect
> >>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
> >>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
> >>>> paste("p",1:nprobes,sep=""))
> >>>>
> >>>> Y
> >>>>          p1       p2       p3       p4       p5
> >>>> e1  6.842297 5.630669 5.909160 5.437896 5.035330
> >>>> e2  7.043689 6.213415 6.225986 5.840217 5.059106
> >>>> e3  8.586128 7.933859 7.953289 7.622725 7.061329
> >>>> e4  7.364726 6.316509 6.440684 6.259188 5.527053
> >>>> e5  7.306090 6.614483 6.492012 6.231634 5.595041
> >>>> e6  8.832364 8.117525 8.046366 7.851080 7.197188
> >>>> e7  7.663201 6.791223 6.840896 6.568744 5.854843
> >>>> e8  5.856420 5.184265 5.009171 4.841334 4.145777
> >>>> e9  6.464340 5.760774 5.930814 5.560690 4.655448
> >>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318
> >>>>
> >>>> f <- medpolish(Y)
> >>>>
> >>>> summaryvalues <- f$overall + f$row
> >>>> summaryvalues
> >>>>      e1       e2       e3       e4       e5       e6        
> >>>> e7       e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063  
> >>>> 6.826611 5.052652       e9      e10 5.760774 5.911843
> >>>> chipEffect
> >>>>      e1       e2       e3       e4       e5       e6        
> >>>> e7       e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065  
> >>>> 6.460916 4.734939       e9      e10 5.313147 5.554338
> >>>> but if we actually calculate the mean squared error (MSE) of the
> >>>> previous estimates with respect to the true chip effect and  
> >>>> compare it
> >>>> with the MSE of the estimate obtained by simply averaging the
> >>>> observations across probes:
> >>>>
> >>>> sum((summaryvalues-chipEffect)^2)
> >>>> [1] 1.528436
> >>>> sum((rowMeans(Y)-chipEffect)^2)
> >>>> [1] 0.9438652
> >>>>
> >>>>
> >>>> averaging seems to me to work better than median polish, am i  
> >>>> missing
> >>>> something here?
> >>> Yes. You are missing the fact that the data from Affy probes  
> >>> usually are not normally distributed. In fact, it is not uncommon  
> >>> for a given probeset to have widely divergent intensity levels for  
> >>> its component probes. Because of the fact that the mean is not  
> >>> robust to outliers, people long ago abandoned methods based on a  
> >>> normal distribution.
> >>>
> >>> So yeah, in your case with noisy normally distributed data, a mean  
> >>> decomposition is more powerful than a median decomposition (it is  
> >>> UMP for normally distributed data after all), but in practice it  
> >>> will be pretty ugly.
> >>>
> >>> Best,
> >>>
> >>> Jim
> >>>
> >>>
> >>>> thanks!
> >>>> robert.
> >>>>
> >>>>
> >>>>
> >>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
> >>>>> Jose,
> >>>>>
> >>>>> Can I add a (possibly naive) suggestion?
> >>>>>
> >>>>> You say normalization is not the issue, its the summarization of  
> >>>>> 3-8  probes per probeset for your 1-colour Nimblegen data.  So  
> >>>>> maybe you  just want to fit the log-additive RMA linear model to  
> >>>>> your data.  This  is akin to estimating a model with probe  
> >>>>> effects and chip effects for  each probeset ... of which you are  
> >>>>> really interested in the chip  effects.
> >>>>>
> >>>>> Say you were able to collect your data from a single probeset  
> >>>>> into a  matrix Y (concocted example below):
> >>>>>
> >>>>> ce <- rnorm(10,mean=6)   # 10 samples
> >>>>> pe <- rnorm(5)           # 5 probes in probeset
> >>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  #  
> >>>>> add  noise
> >>>>>
> >>>>> One way to do the fits in a quick and robust way is median polish:
> >>>>>
> >>>>> f <- medpolish(Y)
> >>>>>
> >>>>> ---------------
> >>>>> > f
> >>>>>
> >>>>> Median Polish Results (Dataset: "Y")
> >>>>>
> >>>>> Overall: 6.949745
> >>>>>
> >>>>> Row Effects:
> >>>>>  [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360  
> >>>>> -0.3210895
> >>>>>  [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
> >>>>>
> >>>>> Column Effects:
> >>>>> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01   
> >>>>> -4.283675e-01
> >>>>>
> >>>>> Residuals:
> >>>>>              [,1]      [,2]      [,3]       [,4]       [,5]
> >>>>>  [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
> >>>>>  [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
> >>>>>  [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
> >>>>>  [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
> >>>>>  [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
> >>>>>  [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
> >>>>>  [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
> >>>>>  [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
> >>>>>  [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
> >>>>> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
> >>>>>
> >>>>> > dim(Y)
> >>>>> [1] 10  5
> >>>>> > f$overall + f$row  # estimated chip effects
> >>>>>  [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656  
> >>>>> 7.467772  6.514582
> >>>>>  [9] 5.691860 7.430009
> >>>>> > ce                 # true chip effects
> >>>>>  [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149  
> >>>>> 6.963525  6.101933
> >>>>>  [9] 5.117349 6.905161
> >>>>> ---------------
> >>>>>
> >>>>> quick sketch ... it would be (fairly) easy to split up your  
> >>>>> table of  log-transformed normalized probe intensities into a  
> >>>>> matrix for each  probeset (e.g. using split() ...), robustly fit  
> >>>>> the model, extract the  chip effects and whammo, there is your  
> >>>>> table of summarized expression  values ... this would only be a  
> >>>>> few lines of R code and probably not  too inefficient (?) ...  
> >>>>> this is effectively what goes on under the  hood of affy::rma()  
> >>>>> and the like (although it uses C code and in a  very general way  
> >>>>> that uses CDF environments).
> >>>>>
> >>>>> I suspect you could use the 'oligo' package to do much the same  
> >>>>> thing,  after using pdInfoBuilder() to correctly assign probes  
> >>>>> to probesets ...
> >>>>>
> >>>>> ... anyways, just some thoughts.
> >>>>>
> >>>>> Cheers,
> >>>>> Mark
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
> >>>>>
> >>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
> >>>>>>
> >>>>>>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
> >>>>>>>
> >>>>>>>> Hi Jose,
> >>>>>>>>
> >>>>>>>> Do you want to do RMA, or just normalize? The problem with  
> >>>>>>>> trying to
> >>>>>>>> wedge things into an AffyBatch is that the affy package will  
> >>>>>>>> then  try
> >>>>>>>> to find the cdfenv that contains the probe to probeset  
> >>>>>>>> mappings,  so by
> >>>>>>>> trying to leverage the AffyBatch infrastructure you will have  
> >>>>>>>> to  also
> >>>>>>>> come up with a fake cdf.
> >>>>>>>>
> >>>>>>>> If you don't have probes that make up a probeset, then I'm  
> >>>>>>>> not  sure the
> >>>>>>>> affy package will be of use here.
> >>>>>>>>
> >>>>>>>> Can you give more details?
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>>
> >>>>>>>> Jim
> >>>>>>> Hi Jim,
> >>>>>>>
> >>>>>>> normalisation is not an issue, it's more to do with the   
> >>>>>>> summarisation of probesets and something like 'Expresso' seems  
> >>>>>>> like  a good way to do what I need (and some other things I  
> >>>>>>> don't need).
> >>>>>>>
> >>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole  
> >>>>>>> genome  promoter arrays, with anything up to 20 probes per  
> >>>>>>> probeset), and  one colour "a la Affymetrix" (expression  
> >>>>>>> arrays, with anything  between 3 to 8 probes per probeset).
> >>>>>>>
> >>>>>>> I've been dealing with teh two colour stuff just like I used  
> >>>>>>> to  deal with my spotted cDNA arrays, using Limma. To  
> >>>>>>> summarise the  data... I've used several approaches. Mostly I  
> >>>>>>> am not interested in  the whole 2.7kb that each "promoter  
> >>>>>>> region" comprises, so I've  taken subsets blah blah... Anyway,  
> >>>>>>> I'm happy with the results there.
> >>>>>>> But for the expression data, I have one channel data, just  
> >>>>>>> like  Affy data. Numblegen provides already normalised and  
> >>>>>>> summarised  data along with the raw data, and they state they  
> >>>>>>> use the RMA  procedure which I've come across with when  
> >>>>>>> readingabout Affy chips,  although I've never analysed Affy  
> >>>>>>> data myself.
> >>>>>>>
> >>>>>>> I'm reasonably happy with the data given to me. It looks   
> >>>>>>> reasonable. So I want to be able to do that myself rather  
> >>>>>>> than  depending on their data (thus allowing me to do things a  
> >>>>>>> bit  differently if I want to), and since the RMA-processed  
> >>>>>>> data looks  good, I am interested in finding a way to do RMA  
> >>>>>>> myself.
> >>>>>>>
> >>>>>>> You're right, the problem with my trying to make an AffyBatch  
> >>>>>>> from  non Affy data is that I'm going to have to create a cdf- 
> >>>>>>> like  file... and probably will encounter other obstacles...  
> >>>>>>> that's why I  thought I'd ask here, as there's people who are  
> >>>>>>> very familiar with  that structure...
> >>>>>>>
> >>>>>>> In my naivety, it seems it should be a simple enough task...  
> >>>>>>> and as  I'm using 4 types of arrays mostly... I'd only have to  
> >>>>>>> do some work  to make these work and then just enjoy the ride  
> >>>>>>> as new experiments  roll in...
> >>>>>>> Am I naive? ;-)
> >>>>>> It is pretty simple to do what you want, but "simple" is of  
> >>>>>> course  in the eye of the beholder - it depends on how familiar  
> >>>>>> you are with  the affy structures from a development point of  
> >>>>>> view.
> >>>>>>
> >>>>>> I am not familiar with Nimblegen, but that might be much  
> >>>>>> easier, as  in working out of the box.
> >>>>>>
> >>>>>> Kasper
> >>>>>>
> >>>>>>
> >>>>>>> I hope I clarified enough what I'm after.
> >>>>>>>
> >>>>>>> Jose
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> -- 
> >>>>>>> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
> >>>>>>> The Wellcome Trust Centre for Cell Biology    Phone: +44  
> >>>>>>> (0)131  6513374
> >>>>>>> Institute for Cell & Molecular Biology        Fax:   +44  
> >>>>>>> (0)131  6507360
> >>>>>>> Swann Building, Mayfield Road
> >>>>>>> University of Edinburgh
> >>>>>>> Edinburgh EH9 3JR
> >>>>>>> UK
> >>>>>>>
> >>>>>>> -- 
> >>>>>>> The University of Edinburgh is a charitable body, registered in
> >>>>>>> Scotland, with registration number SC005336.
> >>>>>>>
> >>>>>>> _______________________________________________
> >>>>>>> Bioconductor mailing list
> >>>>>>> Bioconductor at stat.math.ethz.ch
> >>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>> _______________________________________________
> >>>>>> Bioconductor mailing list
> >>>>>> Bioconductor at stat.math.ethz.ch
> >>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>> ------------------------------
> >>>>> Mark Robinson
> >>>>> Epigenetics Laboratory, Garvan
> >>>>> Bioinformatics Division, WEHI
> >>>>> e: m.robinson at garvan.org.au
> >>>>> e: mrobinson at wehi.edu.au
> >>>>> p: +61 (0)3 9345 2628
> >>>>> f: +61 (0)3 9347 0852
> >>>>>
> >>>>> _______________________________________________
> >>>>> Bioconductor mailing list
> >>>>> Bioconductor at stat.math.ethz.ch
> >>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>>>
> >
> > -- 
> > James W. MacDonald, M.S.
> > Biostatistician
> > Douglas Lab
> > University of Michigan
> > Department of Human Genetics
> > 5912 Buhl
> > 1241 E. Catherine St.
> > Ann Arbor MI 48109-5618
> > 734-615-7826
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor