[BioC] expresso: performing RMA on NON-Affy data?
Mark Robinson
mrobinson at wehi.EDU.AU
Thu Apr 30 14:55:31 CEST 2009
Jose.
Comments below.
On 30/04/2009, at 1:56 AM, J.delasHeras at ed.ac.uk wrote:
> Quoting Mark Robinson <mrobinson at wehi.EDU.AU>:
>
>> Jose,
>>
>> Can I add a (possibly naive) suggestion?
>>
>> You say normalization is not the issue, its the summarization of 3-8
>> probes per probeset for your 1-colour Nimblegen data. So maybe you
>> just want to fit the log-additive RMA linear model to your data.
>> This
>> is akin to estimating a model with probe effects and chip effects for
>> each probeset ... of which you are really interested in the chip
>> effects.
>>
>> Say you were able to collect your data from a single probeset into a
>> matrix Y (concocted example below):
>>
>> ce <- rnorm(10,mean=6) # 10 samples
>> pe <- rnorm(5) # 5 probes in probeset
>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 ) #
>> add noise
>>
>> One way to do the fits in a quick and robust way is median polish:
>>
>> f <- medpolish(Y)
>>
>> ---------------
>>> f
>>
>> Median Polish Results (Dataset: "Y")
>>
>> Overall: 6.949745
>>
>> Row Effects:
>> [1] 1.5885255 0.7841937 0.3210895 -0.9567836 -0.8557360 -0.3210895
>> [7] 0.5180266 -0.4351636 -1.2578855 0.4802634
>>
>> Column Effects:
>> [1] -2.243012e-05 6.828785e-01 -5.986373e-01 3.952830e-01
>> -4.283675e-01
>>
>> Residuals:
>> [,1] [,2] [,3] [,4] [,5]
>> [1,] -0.01316487 -0.051061 0.000000 0.0031193 0.0069587
>> [2,] 0.22046052 0.000000 -0.075148 0.0376293 -0.0069587
>> [3,] -0.03686560 0.000000 0.074483 -0.2351209 0.1423658
>> [4,] -0.06133574 0.077307 0.061807 -0.0031193 -0.0142717
>> [5,] 0.00002243 -0.106866 -0.221060 0.0788912 0.1171018
>> [6,] -0.00002243 0.000000 0.015280 0.1358204 -0.0110629
>> [7,] 0.02006485 0.142389 0.000000 -0.0664879 -0.0125533
>> [8,] -0.13400778 -0.050242 0.000000 0.1374301 0.0241635
>> [9,] 0.01329945 0.013142 0.000000 -0.0784278 -0.0775479
>> [10,] 0.11997319 0.000000 -0.130439 -0.1982599 0.1446157
>>
>>> dim(Y)
>> [1] 10 5
>>> f$overall + f$row # estimated chip effects
>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772
>> 6.514582
>> [9] 5.691860 7.430009
>>> ce # true chip effects
>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525
>> 6.101933
>> [9] 5.117349 6.905161
>> ---------------
>>
>> quick sketch ... it would be (fairly) easy to split up your table of
>> log-transformed normalized probe intensities into a matrix for each
>> probeset (e.g. using split() ...), robustly fit the model, extract
>> the
>> chip effects and whammo, there is your table of summarized expression
>> values ... this would only be a few lines of R code and probably not
>> too inefficient (?) ... this is effectively what goes on under the
>> hood
>> of affy::rma() and the like (although it uses C code and in a very
>> general way that uses CDF environments).
>>
>> I suspect you could use the 'oligo' package to do much the same
>> thing,
>> after using pdInfoBuilder() to correctly assign probes to
>> probesets ...
>>
>> ... anyways, just some thoughts.
>>
>> Cheers,
>> Mark
>
> Hi Mark,
>
> Thanks for your email! Your *naive* suggestion actually solves my
> most important issue.
> I wanted to apply the method and then learn what happened under the
> hood, but I should have probably gone the other way ;-)
>
> I tried the median polish and it reproduces almost exactly the
> summarised data I had, so I'm not sure this is the method they
> employed.
Good, glad it was useful.
> I just altered your example code in one place.
>
> The example had a matrix where columns corresponded to different
> chips (10 replicates) and rows corresponded to probes in the
> probeset (5). One matrix per probeset.
> In your code you suggested I took:
>
> f$overall + f$row (where f=medpolish(matrix))
>
> to obtain the estimated chip effects. This summarises the 10 chips
> into a single figure... per probe.
> What I wanted was to summarise each probeset into a single number,
> so I take instead:
>
> f$overall + f$col
Right, my concocted example was simply the transpose of what you have.
> This gives me the summarised probeset *per chip*, reducing my table
> of 72000ish rows (probes) x 12 chips to one of 24000ish rows
> (probesets) x 12 chips.
>
> That is great.
>
> The obvious question is... would it be reasonable to repeat the
> process to summarise replicate chips?
> Essentially it'll use the whole matrix, and the result could be
> obtained as a single column (24000ish values, one per probeset) like
> this:
>
> f$overall + f$row (as you suggested initially).
>
> probably a slow process, but does this make sense at least?
I'm not sure yet if this makes sense, but let me explain back what I
understand of this. You are proposing to take the whole (72000-by-12,
probes by samples) matrix and in the end, get a 24000-by-1 (probesets
by 1) vector. I don't think medpolish() will work for you here, not
as a single call at least.
The median polish model for a single probeset can be written as:
Y_{ij} = mu + row_i + col_j + error_{ij} [1]
with constraints on row_i and col_j so that the parameter are
identifiable ... a reparameterization of [1] gives ...
Y_{ij} = probe_i + chip_j + error_{ij} [2]
But, maybe what you want to share the chip parameter over all samples
(for replicate samples) and just have:
Y_{ij} = probe_i + chip + error_{ij} [3]
If you wanted to fit everything in one model (keep in mind that this
makes a constant variance assumption), you could run a single rlm fit
something like:
f <- rlm(Ybig ~ probeset + probe)
where Ybig is your 72000-by-12 matrix, probeset takes on 24000
different values and probe takes on 72000 different values (and will
have the constraint). I don't know if this will run in a reasonable
amount of time. Alternatively, you could just take a rowMeans() of
your 24000-by-12 matrix from before and it should be close. Or, yet
another alternative, fit model [3] separately for each probeset (rlm?)
and collect the "chip" estimates in a vector.
Lots of options.
Hope that helps.
Mark
> Jose
>
> --
> Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
> The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131
> 6513374
> Institute for Cell & Molecular Biology Fax: +44 (0)131
> 6507360
> Swann Building, Mayfield Road
> University of Edinburgh
> Edinburgh EH9 3JR
> UK
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
More information about the Bioconductor
mailing list