[BioC] expresso: performing RMA on NON-Affy data?

Thu Apr 30 14:55:31 CEST 2009

Jose.

Comments below.

On 30/04/2009, at 1:56 AM, J.delasHeras at ed.ac.uk wrote:

> Quoting Mark Robinson <mrobinson at wehi.EDU.AU>:
>
>> Jose,
>>
>> Can I add a (possibly naive) suggestion?
>>
>> You say normalization is not the issue, its the summarization of 3-8
>> probes per probeset for your 1-colour Nimblegen data.  So maybe you
>> just want to fit the log-additive RMA linear model to your data.   
>> This
>> is akin to estimating a model with probe effects and chip effects for
>> each probeset ... of which you are really interested in the chip
>> effects.
>>
>> Say you were able to collect your data from a single probeset into a
>> matrix Y (concocted example below):
>>
>> ce <- rnorm(10,mean=6)   # 10 samples
>> pe <- rnorm(5)           # 5 probes in probeset
>> Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  #  
>> add noise
>>
>> One way to do the fits in a quick and robust way is median polish:
>>
>> f <- medpolish(Y)
>>
>> ---------------
>>> f
>>
>> Median Polish Results (Dataset: "Y")
>>
>> Overall: 6.949745
>>
>> Row Effects:
>> [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
>> [7]  0.5180266 -0.4351636 -1.2578855  0.4802634
>>
>> Column Effects:
>> [1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01  
>> -4.283675e-01
>>
>> Residuals:
>>             [,1]      [,2]      [,3]       [,4]       [,5]
>> [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
>> [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
>> [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
>> [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
>> [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
>> [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
>> [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
>> [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
>> [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
>> [10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157
>>
>>> dim(Y)
>> [1] 10  5
>>> f$overall + f$row  # estimated chip effects
>> [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772  
>> 6.514582
>> [9] 5.691860 7.430009
>>> ce                 # true chip effects
>> [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525  
>> 6.101933
>> [9] 5.117349 6.905161
>> ---------------
>>
>> quick sketch ... it would be (fairly) easy to split up your table of
>> log-transformed normalized probe intensities into a matrix for each
>> probeset (e.g. using split() ...), robustly fit the model, extract  
>> the
>> chip effects and whammo, there is your table of summarized expression
>> values ... this would only be a few lines of R code and probably not
>> too inefficient (?) ... this is effectively what goes on under the  
>> hood
>> of affy::rma() and the like (although it uses C code and in a very
>> general way that uses CDF environments).
>>
>> I suspect you could use the 'oligo' package to do much the same  
>> thing,
>> after using pdInfoBuilder() to correctly assign probes to  
>> probesets ...
>>
>> ... anyways, just some thoughts.
>>
>> Cheers,
>> Mark
>
> Hi Mark,
>
> Thanks for your email! Your *naive* suggestion actually solves my  
> most important issue.
> I wanted to apply the method and then learn what happened under the  
> hood, but I should have probably gone the other way ;-)
>
> I tried the median polish and it reproduces almost exactly the  
> summarised data I had, so I'm not sure this is the method they  
> employed.

Good, glad it was useful.

> I just altered your example code in one place.
>
> The example had a matrix where columns corresponded to different  
> chips (10 replicates) and rows corresponded to probes in the  
> probeset (5). One matrix per probeset.
> In your code you suggested I took:
>
> f$overall + f$row (where f=medpolish(matrix))
>
> to obtain the estimated chip effects. This summarises the 10 chips  
> into a single figure... per probe.
> What I wanted was to summarise each probeset into a single number,  
> so I take instead:
>
> f$overall + f$col

Right, my concocted example was simply the transpose of what you have.

> This gives me the summarised probeset *per chip*, reducing my table  
> of 72000ish rows (probes) x 12 chips to one of 24000ish rows  
> (probesets) x 12 chips.
>
> That is great.
>
> The obvious question is... would it be reasonable to repeat the  
> process to summarise replicate chips?
> Essentially it'll use the whole matrix, and the result could be  
> obtained as a single column (24000ish values, one per probeset) like  
> this:
>
> f$overall + f$row (as you suggested initially).
>
> probably a slow process, but does this make sense at least?

I'm not sure yet if this makes sense, but let me explain back what I  
understand of this.  You are proposing to take the whole (72000-by-12,  
probes by samples) matrix and in the end, get a 24000-by-1 (probesets  
by 1) vector.  I don't think medpolish() will work for you here, not  
as a single call at least.

The median polish model for a single probeset can be written as:

Y_{ij} = mu + row_i + col_j + error_{ij}  [1]

with constraints on row_i and col_j so that the parameter are  
identifiable ... a reparameterization of [1] gives ...

Y_{ij} = probe_i + chip_j + error_{ij}  [2]

But, maybe what you want to share the chip parameter over all samples  
(for replicate samples) and just have:

Y_{ij} = probe_i + chip + error_{ij}  [3]

If you wanted to fit everything in one model (keep in mind that this  
makes a constant variance assumption), you could run a single rlm fit  
something like:

f <- rlm(Ybig ~ probeset + probe)

where Ybig is your 72000-by-12 matrix, probeset takes on 24000  
different values and probe takes on 72000 different values (and will  
have the constraint).  I don't know if this will run in a reasonable  
amount of time.  Alternatively, you could just take a rowMeans() of  
your 24000-by-12 matrix from before and it should be close.  Or, yet  
another alternative, fit model [3] separately for each probeset (rlm?)  
and collect the "chip" estimates in a vector.

Lots of options.

Hope that helps.
Mark

> Jose
>
> -- 
> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131  
> 6513374
> Institute for Cell & Molecular Biology        Fax:   +44 (0)131  
> 6507360
> Swann Building, Mayfield Road
> University of Edinburgh
> Edinburgh EH9 3JR
> UK
>
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>

------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852