[BioC] expresso: performing RMA on NON-Affy data?
Kasper Daniel Hansen
khansen at stat.Berkeley.EDU
Sat Apr 25 18:37:12 CEST 2009
Let me turn this around.
I have never seen a convincing simulation of microarray data. You can
find examples of simulated data, but I have never seen simulated data
behaving like real data.
Kasper
On Apr 24, 2009, at 11:35 , James W. MacDonald wrote:
> Hi Robert,
>
> Why would you want to simulate? There have to be hundreds of
> datasets that have celfiles on GEO alone - if you are interested in
> this subject why not just get some real data and see which method is
> better?
>
> Best,
>
> Jim
>
>
>
> Robert Castelo wrote:
>> hi Jim,
>> thanks for clearing this up for me. do you know then what would be
>> the
>> way of simulating the more realistic situation you describe? (such
>> that
>> we would see how median polish outperforms averaging in the
>> summarization step)
>> cheers,
>> robert.
>> On Fri, 2009-04-24 at 08:33 -0400, James W. MacDonald wrote:
>>> Hi Robert,
>>>
>>> Robert Castelo wrote:
>>>> dear Mark,
>>>>
>>>> i've followed with interest your example and i found something a
>>>> bit
>>>> counter intuitive to me, let me run your code again with a seed
>>>> so that
>>>> we can all reproduce the numbers:
>>>>
>>>> nsamples <- 10
>>>> nprobes <- 5
>>>> set.seed(123)
>>>> chipEffect <- rnorm(nsamples, mean=6)
>>>> names(chipEffect) <- paste("e",1:nsamples,sep="")
>>>> probeEffect <- rnorm(nprobes)
>>>> Y <- outer(chipEffect, probeEffect, "+") # probe effect
>>>> Y <- Y + rnorm(nsamples * nprobes, sd=0.1) # noise
>>>> dimnames(Y) <- list(paste("e",1:nsamples,sep=""),
>>>> paste("p",1:nprobes,sep=""))
>>>>
>>>> Y
>>>> p1 p2 p3 p4 p5
>>>> e1 6.842297 5.630669 5.909160 5.437896 5.035330
>>>> e2 7.043689 6.213415 6.225986 5.840217 5.059106
>>>> e3 8.586128 7.933859 7.953289 7.622725 7.061329
>>>> e4 7.364726 6.316509 6.440684 6.259188 5.527053
>>>> e5 7.306090 6.614483 6.492012 6.231634 5.595041
>>>> e6 8.832364 8.117525 8.046366 7.851080 7.197188
>>>> e7 7.663201 6.791223 6.840896 6.568744 5.854843
>>>> e8 5.856420 5.184265 5.009171 4.841334 4.145777
>>>> e9 6.464340 5.760774 5.930814 5.560690 4.655448
>>>> e10 6.715916 5.996310 6.075906 5.642444 4.891318
>>>>
>>>> f <- medpolish(Y)
>>>>
>>>> summaryvalues <- f$overall + f$row
>>>> summaryvalues
>>>> e1 e2 e3 e4 e5 e6
>>>> e7 e8 5.894875 6.211700 7.933859 6.433927 6.501916 8.104063
>>>> 6.826611 5.052652 e9 e10 5.760774 5.911843
>>>> chipEffect
>>>> e1 e2 e3 e4 e5 e6
>>>> e7 e8 5.439524 5.769823 7.558708 6.070508 6.129288 7.715065
>>>> 6.460916 4.734939 e9 e10 5.313147 5.554338
>>>> but if we actually calculate the mean squared error (MSE) of the
>>>> previous estimates with respect to the true chip effect and
>>>> compare it
>>>> with the MSE of the estimate obtained by simply averaging the
>>>> observations across probes:
>>>>
>>>> sum((summaryvalues-chipEffect)^2)
>>>> [1] 1.528436
>>>> sum((rowMeans(Y)-chipEffect)^2)
>>>> [1] 0.9438652
>>>>
>>>>
>>>> averaging seems to me to work better than median polish, am i
>>>> missing
>>>> something here?
>>> Yes. You are missing the fact that the data from Affy probes
>>> usually are not normally distributed. In fact, it is not uncommon
>>> for a given probeset to have widely divergent intensity levels for
>>> its component probes. Because of the fact that the mean is not
>>> robust to outliers, people long ago abandoned methods based on a
>>> normal distribution.
>>>
>>> So yeah, in your case with noisy normally distributed data, a mean
>>> decomposition is more powerful than a median decomposition (it is
>>> UMP for normally distributed data after all), but in practice it
>>> will be pretty ugly.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>> thanks!
>>>> robert.
>>>>
>>>>
>>>>
>>>> On Fri, 2009-04-24 at 11:42 +1000, Mark Robinson wrote:
>>>>> Jose,
>>>>>
>>>>> Can I add a (possibly naive) suggestion?
>>>>>
>>>>> You say normalization is not the issue, its the summarization of
>>>>> 3-8 probes per probeset for your 1-colour Nimblegen data. So
>>>>> maybe you just want to fit the log-additive RMA linear model to
>>>>> your data. This is akin to estimating a model with probe
>>>>> effects and chip effects for each probeset ... of which you are
>>>>> really interested in the chip effects.
>>>>>
>>>>> Say you were able to collect your data from a single probeset
>>>>> into a matrix Y (concocted example below):
>>>>>
>>>>> ce <- rnorm(10,mean=6) # 10 samples
>>>>> pe <- rnorm(5) # 5 probes in probeset
>>>>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) #
>>>>> add noise
>>>>>
>>>>> One way to do the fits in a quick and robust way is median polish:
>>>>>
>>>>> f <- medpolish(Y)
>>>>>
>>>>> ---------------
>>>>> > f
>>>>>
>>>>> Median Polish Results (Dataset: "Y")
>>>>>
>>>>> Overall: 6.949745
>>>>>
>>>>> Row Effects:
>>>>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360
>>>>> -0.3210895
>>>>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634
>>>>>
>>>>> Column Effects:
>>>>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01
>>>>> -4.283675e-01
>>>>>
>>>>> Residuals:
>>>>> [,1] [,2] [,3] [,4] [,5]
>>>>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587
>>>>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587
>>>>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658
>>>>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717
>>>>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018
>>>>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629
>>>>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533
>>>>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635
>>>>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479
>>>>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157
>>>>>
>>>>> > dim(Y)
>>>>> [1] 10 5
>>>>> > f$overall + f$row # estimated chip effects
>>>>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656
>>>>> 7.467772 6.514582
>>>>> [9] 5.691860 7.430009
>>>>> > ce # true chip effects
>>>>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149
>>>>> 6.963525 6.101933
>>>>> [9] 5.117349 6.905161
>>>>> ---------------
>>>>>
>>>>> quick sketch ... it would be (fairly) easy to split up your
>>>>> table of log-transformed normalized probe intensities into a
>>>>> matrix for each probeset (e.g. using split() ...), robustly fit
>>>>> the model, extract the chip effects and whammo, there is your
>>>>> table of summarized expression values ... this would only be a
>>>>> few lines of R code and probably not too inefficient (?) ...
>>>>> this is effectively what goes on under the hood of affy::rma()
>>>>> and the like (although it uses C code and in a very general way
>>>>> that uses CDF environments).
>>>>>
>>>>> I suspect you could use the 'oligo' package to do much the same
>>>>> thing, after using pdInfoBuilder() to correctly assign probes
>>>>> to probesets ...
>>>>>
>>>>> ... anyways, just some thoughts.
>>>>>
>>>>> Cheers,
>>>>> Mark
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:
>>>>>
>>>>>> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
>>>>>>
>>>>>>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
>>>>>>>
>>>>>>>> Hi Jose,
>>>>>>>>
>>>>>>>> Do you want to do RMA, or just normalize? The problem with
>>>>>>>> trying to
>>>>>>>> wedge things into an AffyBatch is that the affy package will
>>>>>>>> then try
>>>>>>>> to find the cdfenv that contains the probe to probeset
>>>>>>>> mappings, so by
>>>>>>>> trying to leverage the AffyBatch infrastructure you will have
>>>>>>>> to also
>>>>>>>> come up with a fake cdf.
>>>>>>>>
>>>>>>>> If you don't have probes that make up a probeset, then I'm
>>>>>>>> not sure the
>>>>>>>> affy package will be of use here.
>>>>>>>>
>>>>>>>> Can you give more details?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jim
>>>>>>> Hi Jim,
>>>>>>>
>>>>>>> normalisation is not an issue, it's more to do with the
>>>>>>> summarisation of probesets and something like 'Expresso' seems
>>>>>>> like a good way to do what I need (and some other things I
>>>>>>> don't need).
>>>>>>>
>>>>>>> I am dealing with Nimblegen arrays. Both two colour (whole
>>>>>>> genome promoter arrays, with anything up to 20 probes per
>>>>>>> probeset), and one colour "a la Affymetrix" (expression
>>>>>>> arrays, with anything between 3 to 8 probes per probeset).
>>>>>>>
>>>>>>> I've been dealing with teh two colour stuff just like I used
>>>>>>> to deal with my spotted cDNA arrays, using Limma. To
>>>>>>> summarise the data... I've used several approaches. Mostly I
>>>>>>> am not interested in the whole 2.7kb that each "promoter
>>>>>>> region" comprises, so I've taken subsets blah blah... Anyway,
>>>>>>> I'm happy with the results there.
>>>>>>> But for the expression data, I have one channel data, just
>>>>>>> like Affy data. Numblegen provides already normalised and
>>>>>>> summarised data along with the raw data, and they state they
>>>>>>> use the RMA procedure which I've come across with when
>>>>>>> readingabout Affy chips, although I've never analysed Affy
>>>>>>> data myself.
>>>>>>>
>>>>>>> I'm reasonably happy with the data given to me. It looks
>>>>>>> reasonable. So I want to be able to do that myself rather
>>>>>>> than depending on their data (thus allowing me to do things a
>>>>>>> bit differently if I want to), and since the RMA-processed
>>>>>>> data looks good, I am interested in finding a way to do RMA
>>>>>>> myself.
>>>>>>>
>>>>>>> You're right, the problem with my trying to make an AffyBatch
>>>>>>> from non Affy data is that I'm going to have to create a cdf-
>>>>>>> like file... and probably will encounter other obstacles...
>>>>>>> that's why I thought I'd ask here, as there's people who are
>>>>>>> very familiar with that structure...
>>>>>>>
>>>>>>> In my naivety, it seems it should be a simple enough task...
>>>>>>> and as I'm using 4 types of arrays mostly... I'd only have to
>>>>>>> do some work to make these work and then just enjoy the ride
>>>>>>> as new experiments roll in...
>>>>>>> Am I naive? ;-)
>>>>>> It is pretty simple to do what you want, but "simple" is of
>>>>>> course in the eye of the beholder - it depends on how familiar
>>>>>> you are with the affy structures from a development point of
>>>>>> view.
>>>>>>
>>>>>> I am not familiar with Nimblegen, but that might be much
>>>>>> easier, as in working out of the box.
>>>>>>
>>>>>> Kasper
>>>>>>
>>>>>>
>>>>>>> I hope I clarified enough what I'm after.
>>>>>>>
>>>>>>> Jose
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
>>>>>>> The Wellcome Trust Centre for Cell Biology Phone: +44
>>>>>>> (0)131 6513374
>>>>>>> Institute for Cell & Molecular Biology Fax: +44
>>>>>>> (0)131 6507360
>>>>>>> Swann Building, Mayfield Road
>>>>>>> University of Edinburgh
>>>>>>> Edinburgh EH9 3JR
>>>>>>> UK
>>>>>>>
>>>>>>> --
>>>>>>> The University of Edinburgh is a charitable body, registered in
>>>>>>> Scotland, with registration number SC005336.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>> ------------------------------
>>>>> Mark Robinson
>>>>> Epigenetics Laboratory, Garvan
>>>>> Bioinformatics Division, WEHI
>>>>> e: m.robinson at garvan.org.au
>>>>> e: mrobinson at wehi.edu.au
>>>>> p: +61 (0)3 9345 2628
>>>>> f: +61 (0)3 9347 0852
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
More information about the Bioconductor
mailing list