[BioC] expresso: performing RMA on NON-Affy data?

Fri Apr 24 03:42:30 CEST 2009

Jose,

Can I add a (possibly naive) suggestion?

You say normalization is not the issue, its the summarization of 3-8  
probes per probeset for your 1-colour Nimblegen data.  So maybe you  
just want to fit the log-additive RMA linear model to your data.  This  
is akin to estimating a model with probe effects and chip effects for  
each probeset ... of which you are really interested in the chip  
effects.

Say you were able to collect your data from a single probeset into a  
matrix Y (concocted example below):

ce <- rnorm(10,mean=6)   # 10 samples
pe <- rnorm(5)           # 5 probes in probeset
Y <- outer(ce,pe,"+") + rnorm( length(ce)*length(pe), sd=.1 )  # add  
noise

One way to do the fits in a quick and robust way is median polish:

f <- medpolish(Y)

---------------
 > f

Median Polish Results (Dataset: "Y")

Overall: 6.949745

Row Effects:
  [1]  1.5885255  0.7841937  0.3210895 -0.9567836 -0.8557360 -0.3210895
  [7]  0.5180266 -0.4351636 -1.2578855  0.4802634

Column Effects:
[1] -2.243012e-05  6.828785e-01 -5.986373e-01  3.952830e-01  
-4.283675e-01

Residuals:
              [,1]      [,2]      [,3]       [,4]       [,5]
  [1,] -0.01316487 -0.051061  0.000000  0.0031193  0.0069587
  [2,]  0.22046052  0.000000 -0.075148  0.0376293 -0.0069587
  [3,] -0.03686560  0.000000  0.074483 -0.2351209  0.1423658
  [4,] -0.06133574  0.077307  0.061807 -0.0031193 -0.0142717
  [5,]  0.00002243 -0.106866 -0.221060  0.0788912  0.1171018
  [6,] -0.00002243  0.000000  0.015280  0.1358204 -0.0110629
  [7,]  0.02006485  0.142389  0.000000 -0.0664879 -0.0125533
  [8,] -0.13400778 -0.050242  0.000000  0.1374301  0.0241635
  [9,]  0.01329945  0.013142  0.000000 -0.0784278 -0.0775479
[10,]  0.11997319  0.000000 -0.130439 -0.1982599  0.1446157

 > dim(Y)
[1] 10  5
 > f$overall + f$row  # estimated chip effects
  [1] 8.538271 7.733939 7.270835 5.992962 6.094009 6.628656 7.467772  
6.514582
  [9] 5.691860 7.430009
 > ce                 # true chip effects
  [1] 8.000779 7.193683 6.778929 5.419198 5.591459 6.133149 6.963525  
6.101933
  [9] 5.117349 6.905161
---------------

quick sketch ... it would be (fairly) easy to split up your table of  
log-transformed normalized probe intensities into a matrix for each  
probeset (e.g. using split() ...), robustly fit the model, extract the  
chip effects and whammo, there is your table of summarized expression  
values ... this would only be a few lines of R code and probably not  
too inefficient (?) ... this is effectively what goes on under the  
hood of affy::rma() and the like (although it uses C code and in a  
very general way that uses CDF environments).

I suspect you could use the 'oligo' package to do much the same thing,  
after using pdInfoBuilder() to correctly assign probes to probesets ...

... anyways, just some thoughts.

Cheers,
Mark

On 24/04/2009, at 2:08 AM, Kasper Daniel Hansen wrote:

>
> On Apr 23, 2009, at 1:27 , J.delasHeras at ed.ac.uk wrote:
>
>> Quoting "James W. MacDonald" <jmacdon at med.umich.edu>:
>>
>>> Hi Jose,
>>>
>>> Do you want to do RMA, or just normalize? The problem with trying to
>>> wedge things into an AffyBatch is that the affy package will then  
>>> try
>>> to find the cdfenv that contains the probe to probeset mappings,  
>>> so by
>>> trying to leverage the AffyBatch infrastructure you will have to  
>>> also
>>> come up with a fake cdf.
>>>
>>> If you don't have probes that make up a probeset, then I'm not  
>>> sure the
>>> affy package will be of use here.
>>>
>>> Can you give more details?
>>>
>>> Best,
>>>
>>> Jim
>>
>> Hi Jim,
>>
>> normalisation is not an issue, it's more to do with the  
>> summarisation of probesets and something like 'Expresso' seems like  
>> a good way to do what I need (and some other things I don't need).
>>
>> I am dealing with Nimblegen arrays. Both two colour (whole genome  
>> promoter arrays, with anything up to 20 probes per probeset), and  
>> one colour "a la Affymetrix" (expression arrays, with anything  
>> between 3 to 8 probes per probeset).
>>
>> I've been dealing with teh two colour stuff just like I used to  
>> deal with my spotted cDNA arrays, using Limma. To summarise the  
>> data... I've used several approaches. Mostly I am not interested in  
>> the whole 2.7kb that each "promoter region" comprises, so I've  
>> taken subsets blah blah... Anyway, I'm happy with the results there.
>> But for the expression data, I have one channel data, just like  
>> Affy data. Numblegen provides already normalised and summarised  
>> data along with the raw data, and they state they use the RMA  
>> procedure which I've come across with when readingabout Affy chips,  
>> although I've never analysed Affy data myself.
>>
>> I'm reasonably happy with the data given to me. It looks  
>> reasonable. So I want to be able to do that myself rather than  
>> depending on their data (thus allowing me to do things a bit  
>> differently if I want to), and since the RMA-processed data looks  
>> good, I am interested in finding a way to do RMA myself.
>>
>> You're right, the problem with my trying to make an AffyBatch from  
>> non Affy data is that I'm going to have to create a cdf-like  
>> file... and probably will encounter other obstacles... that's why I  
>> thought I'd ask here, as there's people who are very familiar with  
>> that structure...
>>
>> In my naivety, it seems it should be a simple enough task... and as  
>> I'm using 4 types of arrays mostly... I'd only have to do some work  
>> to make these work and then just enjoy the ride as new experiments  
>> roll in...
>> Am I naive? ;-)
>
> It is pretty simple to do what you want, but "simple" is of course  
> in the eye of the beholder - it depends on how familiar you are with  
> the affy structures from a development point of view.
>
> I am not familiar with Nimblegen, but that might be much easier, as  
> in working out of the box.
>
> Kasper
>
>
>> I hope I clarified enough what I'm after.
>>
>> Jose
>>
>>
>>
>> -- 
>> Dr. Jose I. de las Heras                      Email: J.delasHeras at ed.ac.uk
>> The Wellcome Trust Centre for Cell Biology    Phone: +44 (0)131  
>> 6513374
>> Institute for Cell & Molecular Biology        Fax:   +44 (0)131  
>> 6507360
>> Swann Building, Mayfield Road
>> University of Edinburgh
>> Edinburgh EH9 3JR
>> UK
>>
>> -- 
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------
Mark Robinson
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852