[BioC] summarizing probe intensites before or after normalization- 1. how to do with RMA 2. Opinions?

Mon Sep 11 17:37:18 CEST 2006

Hi Karl,

Please keep this on the list.

Since you have repeated the experiment with new samples, and there is 
such a huge difference between the two sets, I would tend to separate 
them (e.g., run RMA separately on the two batches), and then fit a mixed 
effects model. You can use the GeneMetaEx package to do this. See this 
post for more info:

http://article.gmane.org/gmane.science.biology.informatics.conductor/7981/match=gentleman+mixed+model

There may not be enough replication in each set to accurately estimate 
the variance/covariance matrix (somebody like Robert Gentleman would be 
better able to answer that question).

Alternatively, if the old arrays are consistently 'dimmer' than the new 
by a relatively constant amount, you might just be able to fit a fixed 
effects ANOVA with a batch effect. An example of doing that can be found 
here, where they fit a 'day' effect:

http://bioinf.wehi.edu.au/marray/jsm2005/lab5/lab5.html

Best,

Jim

k. brand wrote:
> Jim,
> 
> Yes, i succeeded in normalizing the RMA data with:
> 
> datrma.postqnorm <- normalize(datrma)
> 
> Sorry to shock you...wait till you see the un-normalized data (attached- 
> "RMA no norm.jpeg"!
> 
> Unfortunately the disparity is technical not biological. The low arrays 
> were 2 years past expiration, i know the quality is significantly 
> impaired. However good QPCR results justified my department allowing me 
> to use some fresh arrays to finish the experiment.
> 
> So i have half my experiment (2 replicates for three tissues) on old 
> arrays, and the other half (2 replicates for three tissues) on fresh 
> arrays. The data is ok, i get some pathways at the end which look ok. 
> BUt i want to wring the most from this imperfect data set.
> 
> And yes, it is definitely the correct RMA output. Normalizing after 
> summarizing DOES however enforce equal distributions for all arrays (see 
> attached "RMA then BB norm.jpeg"). Yet i still try to understand what is 
> the better approach, at least in my "disparate" situation- to normalize 
> before or after summarization.
> 
> What do you think?
> 
> Karl
> 
> PS running another 2 arrays is not an option....
> 
> 
> on 9/11/2006 4:57 PM James W. MacDonald said the following:
> 
>> Hi Karl,
>>
>> You have to convert the expression data to a data.frame in order to 
>> get a boxplot for each column.
>>
>> boxplot(data.frame(exprs(datrma)))
>>
>> I am shocked at the boxplots for rma(). I have never seen RMA 
>> processed data look that different, which makes me wonder what the raw 
>> data look like. Also, are you sure these are RMA processed data (e.g., 
>> you didn't accidently mask the rma() processed data with other data)? 
>> Unless the raw data are completely horrible, I don't know how you 
>> could get such disparate results.
>>
>>
>>
>> Best,
>>
>> Jim
>>
>> k. brand wrote:
>>
>>> James,
>>>
>>> Thank you for your fast detailed response.
>>>
>>> At your suggetion i tried your suugestion- ie.,
>>>
>>> library(affyPLM)
>>> dat <- ReadAffy()
>>> datrma <- rma(dat, normalize=FALSE)
>>> datrma <- normalize.quantiles(exprs(datrma))
>>> boxplot(datrma)
>>>
>>> Find attached the boxplot output "RMA then QnormJimscript.jpeg" which 
>>> indicates some missing syntax since there is only one box for the 12 
>>> arrays?! In fact in my attempts to realize this endeavour i saw this 
>>> ouput too. Unfortunately my lack of R knowledge prevented me getting 
>>> around whatever the problem is. If you have a further suggestion im 
>>> all ears.
>>>
>>> "data", the result of inheriting my teachers bad habits, is now "dat"...
>>>
>>> Further i attach a boxplot (better than histogram right- "RMA vs Qunt 
>>> norm of MAS5 preproced & summed.jpg") comparing the two methods in my 
>>> orignal post. When i look at the variation of the RMA processed and 
>>> normed data i question how effectively can i compare these arrays 
>>> with each. Especially along side the method shown which is MAS5 
>>> preprocessed then quantile normalized using a colleagues script. 
>>> These arrays look much more comparable, even ideally so. Why dont i 
>>> just use this appraoch you may ask? A: im convinced RMA is a superior 
>>> preprocessing and summarizing method...i just need to be able to 
>>> reconcile the apparent variation in the final output. Or perhaps, 
>>> better understand it.
>>>
>>> Your further thoughts, suggests greatly appreciated,
>>>
>>> karl
>>>
>>>
>>>
>>>
>>>
>>> on 9/11/2006 3:42 PM James W. MacDonald said the following:
>>>
>>>> k. brand wrote:
>>>>
>>>>> Dear All,
>>>>>
>>>>> I compared two normalization approaches for an experiment using 
>>>>> twelve affy 430-2.0 chips. (histogram plot comparing bith methods 
>>>>> forwarded on request).
>>>>>
>>>>> #1. RMA
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datarma <- rma(data)
>>>>> exprs2excel(datarma, file="dataRMA.csv")
>>>>>
>>>>> Plotting histograms of the output shows arrays NOT perfectly 
>>>>> aligning at the means and spreads.
>>>>>
>>>>> I used a custom script to effect a quantile normalization on MAS5 
>>>>> preprocessed but unnormalized data-
>>>>>
>>>>> #2. Mas5 sans interchip normalization
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datamas5sannorm <- mas5(data, normalize=FALSE)
>>>>> exprs2excel(datamas5sannorm, file="datamas5sannorm.csv")
>>>>> f.qnorm <- function(x,qinit=0.75,perc=100)  {...
>>>>>
>>>>> The means and spreads of this normalization approach do align 
>>>>> perfectly.
>>>>>
>>>>> THUS- summarizing probe intensites before or after normalization 
>>>>> does appear to make a noticeable difference, as may be expected.
>>>>>
>>>>> My questions/requests-
>>>>>
>>>>> 1. Help to effect Bolstad normalization of the RMA preprocessed and 
>>>>> summarized data. Whilst I succeed in generating unnormalized RMA 
>>>>> preprocessed data with-
>>>>>
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datarma <- rma(data, normalize=FALSE)
>>>>
>>>>
>>>>
>>>> Next step would be
>>>>
>>>> datarma <- normalize.quantiles(exprs(datarma))
>>>>
>>>> also note that 'data' is not a very good variable name, as you are 
>>>> masking an existing function. When creating variable names it is 
>>>> often enlightening to type the name first at an R prompt to see if 
>>>> you get any response.
>>>>
>>>>
>>>>>
>>>>> As a result of my limited R experience, I failed in finding a 
>>>>> method to effect Bolstad (quantile) normalization on this output.
>>>>>
>>>>> 2. Thoughts/comments on the benefits/caveats of normalizing before 
>>>>> or after summarizing probe intensities.
>>>>
>>>>
>>>>
>>>> Normalizing after summarization for something like rma() seems 
>>>> questionable to me. Since the expression values are based on fitting 
>>>> a model to the PM probe values, if you don't normalize first you are 
>>>> ignoring any non-biological variability which may end up biasing 
>>>> your results. Using median polish for the model fit should help 
>>>> protect against this, but I don't know that I would want to take 
>>>> chances.
>>>>
>>>> As an aside, how far off are the histograms? Are you sure that there 
>>>> is a reasonable difference? Eyeballing a histogram isn't the best 
>>>> way to determine if the mean and variance are different or not. A 
>>>> quick run through with some data here shows very little differences:
>>>>
>>>>  > eset <- justRMA(filenames=list.celfiles()[1:10])
>>>>  > apply(exprs(eset),2,summary)
>>>>         Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7
>>>> Min.       4.085    4.070    4.091    4.051    4.068    4.090    4.087
>>>> 1st Qu.    5.835    5.859    5.832    5.812    5.842    5.858    5.852
>>>> Median     7.079    7.069    7.048    7.061    7.070    7.077    7.080
>>>> Mean       7.225    7.227    7.224    7.227    7.229    7.225    7.232
>>>> 3rd Qu.    8.352    8.324    8.351    8.363    8.361    8.330    8.347
>>>> Max.      14.550   14.440   14.420   14.400   14.490   14.430   14.260
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>>
>>>>> I look forward to any thoughts, advice & suggestions from users.
>>>>>
>>>>> thanks in advance,
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> ===========================================
>>>>>
>>>>>    > sessionInfo()
>>>>> Version 2.3.0 (2006-04-24)
>>>>> i386-pc-mingw32
>>>>>
>>>>> attached base packages:
>>>>> [1] "tools"     "methods"   "stats"     "graphics"  "grDevices" 
>>>>> "utils"
>>>>>        "datasets"  "base"
>>>>>
>>>>> other attached packages:
>>>>>        affy   affyio  Biobase
>>>>> "1.10.0"  "1.0.0" "1.10.0"
>>>>>
>>>>
>>>>
>>>
>>
>>
> 

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.