[BioC] expresso: performing RMA on NON-Affy data?
James W. MacDonald
jmacdon at med.umich.edu
Fri Apr 24 20:35:12 CEST 2009
Hi Robert,
Why would you want to simulate? There have to be hundreds of datasets
that have celfiles on GEO alone - if you are interested in this subject
why not just get some real data and see which method is better?
Best,
Jim
Robert Castelo wrote:
> hi Jim,
>
> thanks for clearing this up for me. do you know then what would be the
> way of simulating the more realistic situation you describe? (such that
> we would see how median polish outperforms averaging in the
> summarization step)
>
> cheers,
> robert.
>
>
> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote:
>> Hi Robert,
>>
>> Robert Castelo wrote:
>>> dear Mark,
>>>
>>> i've followed with interest your example and i found something a bit
>>> counter intuitive to me, let me run your code again with a seed so that
>>> we can all reproduce the numbers:
>>>
>>> nsamples <- 10
>>> nprobes <- 5
>>> set.seed(123)
>>> chipEffect <- rnorm(nsamples, mean=6)
>>> names(chipEffect) <- paste("e",1:nsamples,sep="")
>>> probeEffect <- rnorm(nprobes)
>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect
>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
>>> paste("p",1:nprobes,sep=""))
>>>
>>> Y
>>> p1 p2 p3 p4 p5
>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330
>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106
>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329
>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053
>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041
>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188
>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843
>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777
>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448
>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318
>>>
>>> f <- medpolish(Y)
>>>
>>> summaryvalues <- f$overall + f$row
>>> summaryvalues
>>> e1 e2 e3 e4 e5 e6 e7 e8
>>> 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063 6.826611 5.052652
>>> e9 e10
>>> 5.760774 5.911843
>>>
>>> chipEffect
>>> e1 e2 e3 e4 e5 e6 e7 e8
>>> 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065 6.460916 4.734939
>>> e9 e10
>>> 5.313147 5.554338
>>>
>>> but if we actually calculate the mean squared error (MSE) of the
>>> previous estimates with respect to the true chip effect and compare it
>>> with the MSE of the estimate obtained by simply averaging the
>>> observations across probes:
>>>
>>> sum((summaryvalues-chipEffect)^2)
>>> [1] 1.528436
>>> sum((rowMeans(Y)-chipEffect)^2)
>>> [1] 0.9438652
>>>
>>>
>>> averaging seems to me to work better than median polish, am i missing
>>> something here?
>> Yes. You are missing the fact that the data from Affy probes usually are
>> not normally distributed. In fact, it is not uncommon for a given
>> probeset to have widely divergent intensity levels for its component
>> probes. Because of the fact that the mean is not robust to outliers,
>> people long ago abandoned methods based on a normal distribution.
>>
>> So yeah, in your case with noisy normally distributed data, a mean
>> decomposition is more powerful than a median decomposition (it is UMP
>> for normally distributed data after all), but in practice it will be
>> pretty ugly.
>>
>> Best,
>>
>> Jim
>>
>>
>>> thanks!
>>> robert.
>>>
>>>
>>>
>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
>>>> Jose,
>>>>
>>>> Can I add a (possibly naive) suggestion?
>>>>
>>>> You say normalization is not the issue, its the summarization of 3-8
>>>> probes per probeset for your 1-colour Nimblegen data. So maybe you
>>>> just want to fit the log-additive RMA linear model to your data. This
>>>> is akin to estimating a model with probe effects and chip effects for
>>>> each probeset ... of which you are really interested in the chip
>>>> effects.
>>>>
>>>> Say you were able to collect your data from a single probeset into a
>>>> matrix Y (concocted example below):
>>>>
>>>> ce <- rnorm(10,mean=6) # 10 samples
>>>> pe <- rnorm(5) # 5 probes in probeset
>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) # add
>>>> noise
>>>>
>>>> One way to do the fits in a quick and robust way is median polish:
>>>>
>>>> f <- medpolish(Y)
>>>>
>>>> ---------------
>>>> > f
>>>>
>>>> Median Polish Results (Dataset: "Y")
>>>>
>>>> Overall: 6.949745
>>>>
>>>> Row Effects:
>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895
>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634
>>>>
>>>> Column Effects:
>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01
>>>> -4.283675e-01
>>>>
>>>> Residuals:
>>>> [,1] [,2] [,3] [,4] [,5]
>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587
>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587
>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658
>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717
>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018
>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629
>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533
>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635
>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479
>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157
>>>>
>>>> > dim(Y)
>>>> [1] 10 5
>>>> > f$overall + f$row # estimated chip effects
>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772
>>>> 6.514582
>>>> [9] 5.691860 7.430009
>>>> > ce # true chip effects
>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525
>>>> 6.101933
>>>> [9] 5.117349 6.905161
>>>> ---------------
>>>>
>>>> quick sketch ... it would be (fairly) easy to split up your table of
>>>> log-transformed normalized probe intensities into a matrix for each
>>>> probeset (e.g. using split() ...), robustly fit the model, extract the
>>>> chip effects and whammo, there is your table of summarized expression
>>>> values ... this would only be a few lines of R code and probably not
>>>> too inefficient (?) ... this is effectively what goes on under the
>>>> hood of affy::rma() and the like (although it uses C code and in a
>>>> very general way that uses CDF environments).
>>>>
>>>> I suspect you could use the 'oligo' package to do much the same thing,
>>>> after using pdInfoBuilder() to correctly assign probes to probesets ...
>>>>
>>>> ... anyways, just some thoughts.
>>>>
>>>> Cheers,
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
>>>>
>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
>>>>>
>>>>>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
>>>>>>
>>>>>>> Hi Jose,
>>>>>>>
>>>>>>> Do you want to do RMA, or just normalize? The problem with trying to
>>>>>>> wedge things into an AffyBatch is that the affy package will then
>>>>>>> try
>>>>>>> to find the cdfenv that contains the probe to probeset mappings,
>>>>>>> so by
>>>>>>> trying to leverage the AffyBatch infrastructure you will have to
>>>>>>> also
>>>>>>> come up with a fake cdf.
>>>>>>>
>>>>>>> If you don't have probes that make up a probeset, then I'm not
>>>>>>> sure the
>>>>>>> affy package will be of use here.
>>>>>>>
>>>>>>> Can you give more details?
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>> Hi Jim,
>>>>>>
>>>>>> normalisation is not an issue, it's more to do with the
>>>>>> summarisation of probesets and something like 'Expresso' seems like
>>>>>> a good way to do what I need (and some other things I don't need).
>>>>>>
>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole genome
>>>>>> promoter arrays, with anything up to 20 probes per probeset), and
>>>>>> one colour "a la Affymetrix" (expression arrays, with anything
>>>>>> between 3 to 8 probes per probeset).
>>>>>>
>>>>>> I've been dealing with teh two colour stuff just like I used to
>>>>>> deal with my spotted cDNA arrays, using Limma. To summarise the
>>>>>> data... I've used several approaches. Mostly I am not interested in
>>>>>> the whole 2.7kb that each "promoter region" comprises, so I've
>>>>>> taken subsets blah blah... Anyway, I'm happy with the results there.
>>>>>> But for the expression data, I have one channel data, just like
>>>>>> Affy data. Numblegen provides already normalised and summarised
>>>>>> data along with the raw data, and they state they use the RMA
>>>>>> procedure which I've come across with when readingabout Affy chips,
>>>>>> although I've never analysed Affy data myself.
>>>>>>
>>>>>> I'm reasonably happy with the data given to me. It looks
>>>>>> reasonable. So I want to be able to do that myself rather than
>>>>>> depending on their data (thus allowing me to do things a bit
>>>>>> differently if I want to), and since the RMA-processed data looks
>>>>>> good, I am interested in finding a way to do RMA myself.
>>>>>>
>>>>>> You're right, the problem with my trying to make an AffyBatch from
>>>>>> non Affy data is that I'm going to have to create a cdf-like
>>>>>> file... and probably will encounter other obstacles... that's why I
>>>>>> thought I'd ask here, as there's people who are very familiar with
>>>>>> that structure...
>>>>>>
>>>>>> In my naivety, it seems it should be a simple enough task... and as
>>>>>> I'm using 4 types of arrays mostly... I'd only have to do some work
>>>>>> to make these work and then just enjoy the ride as new experiments
>>>>>> roll in...
>>>>>> Am I naive? ;-)
>>>>> It is pretty simple to do what you want, but "simple" is of course
>>>>> in the eye of the beholder - it depends on how familiar you are with
>>>>> the affy structures from a development point of view.
>>>>>
>>>>> I am not familiar with Nimblegen, but that might be much easier, as
>>>>> in working out of the box.
>>>>>
>>>>> Kasper
>>>>>
>>>>>
>>>>>> I hope I clarified enough what I'm after.
>>>>>>
>>>>>> Jose
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
>>>>>> 6513374
>>>>>> Institute for Cell & Molecular Biology Fax: +44 (0)131
>>>>>> 6507360
>>>>>> Swann Building, Mayfield Road
>>>>>> University of Edinburgh
>>>>>> Edinburgh EH9 3JR
>>>>>> UK
>>>>>>
>>>>>> --
>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>> Scotland, with registration number SC005336.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> ------------------------------
>>>> Mark Robinson
>>>> Epigenetics Laboratory, Garvan
>>>> Bioinformatics Division, WEHI
>>>> e: m.robinson at garvan.org.au
>>>> e: mrobinson at wehi.edu.au
>>>> p: +61 (0)3 9345 2628
>>>> f: +61 (0)3 9347 0852
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
More information about the Bioconductor
mailing list