[BioC] summarizing probe intensites before or after normalization- 1. how to do with RMA 2. Opinions?
James W. MacDonald
jmacdon at med.umich.edu
Mon Sep 11 17:37:18 CEST 2006
Hi Karl,
Please keep this on the list.
Since you have repeated the experiment with new samples, and there is
such a huge difference between the two sets, I would tend to separate
them (e.g., run RMA separately on the two batches), and then fit a mixed
effects model. You can use the GeneMetaEx package to do this. See this
post for more info:
http://article.gmane.org/gmane.science.biology.informatics.conductor/7981/match=gentleman+mixed+model
There may not be enough replication in each set to accurately estimate
the variance/covariance matrix (somebody like Robert Gentleman would be
better able to answer that question).
Alternatively, if the old arrays are consistently 'dimmer' than the new
by a relatively constant amount, you might just be able to fit a fixed
effects ANOVA with a batch effect. An example of doing that can be found
here, where they fit a 'day' effect:
http://bioinf.wehi.edu.au/marray/jsm2005/lab5/lab5.html
Best,
Jim
k. brand wrote:
> Jim,
>
> Yes, i succeeded in normalizing the RMA data with:
>
> datrma.postqnorm <- normalize(datrma)
>
> Sorry to shock you...wait till you see the un-normalized data (attached-
> "RMA no norm.jpeg"!
>
> Unfortunately the disparity is technical not biological. The low arrays
> were 2 years past expiration, i know the quality is significantly
> impaired. However good QPCR results justified my department allowing me
> to use some fresh arrays to finish the experiment.
>
> So i have half my experiment (2 replicates for three tissues) on old
> arrays, and the other half (2 replicates for three tissues) on fresh
> arrays. The data is ok, i get some pathways at the end which look ok.
> BUt i want to wring the most from this imperfect data set.
>
> And yes, it is definitely the correct RMA output. Normalizing after
> summarizing DOES however enforce equal distributions for all arrays (see
> attached "RMA then BB norm.jpeg"). Yet i still try to understand what is
> the better approach, at least in my "disparate" situation- to normalize
> before or after summarization.
>
> What do you think?
>
> Karl
>
> PS running another 2 arrays is not an option....
>
>
> on 9/11/2006 4:57 PM James W. MacDonald said the following:
>
>> Hi Karl,
>>
>> You have to convert the expression data to a data.frame in order to
>> get a boxplot for each column.
>>
>> boxplot(data.frame(exprs(datrma)))
>>
>> I am shocked at the boxplots for rma(). I have never seen RMA
>> processed data look that different, which makes me wonder what the raw
>> data look like. Also, are you sure these are RMA processed data (e.g.,
>> you didn't accidently mask the rma() processed data with other data)?
>> Unless the raw data are completely horrible, I don't know how you
>> could get such disparate results.
>>
>>
>>
>> Best,
>>
>> Jim
>>
>> k. brand wrote:
>>
>>> James,
>>>
>>> Thank you for your fast detailed response.
>>>
>>> At your suggetion i tried your suugestion- ie.,
>>>
>>> library(affyPLM)
>>> dat <- ReadAffy()
>>> datrma <- rma(dat, normalize=FALSE)
>>> datrma <- normalize.quantiles(exprs(datrma))
>>> boxplot(datrma)
>>>
>>> Find attached the boxplot output "RMA then QnormJimscript.jpeg" which
>>> indicates some missing syntax since there is only one box for the 12
>>> arrays?! In fact in my attempts to realize this endeavour i saw this
>>> ouput too. Unfortunately my lack of R knowledge prevented me getting
>>> around whatever the problem is. If you have a further suggestion im
>>> all ears.
>>>
>>> "data", the result of inheriting my teachers bad habits, is now "dat"...
>>>
>>> Further i attach a boxplot (better than histogram right- "RMA vs Qunt
>>> norm of MAS5 preproced & summed.jpg") comparing the two methods in my
>>> orignal post. When i look at the variation of the RMA processed and
>>> normed data i question how effectively can i compare these arrays
>>> with each. Especially along side the method shown which is MAS5
>>> preprocessed then quantile normalized using a colleagues script.
>>> These arrays look much more comparable, even ideally so. Why dont i
>>> just use this appraoch you may ask? A: im convinced RMA is a superior
>>> preprocessing and summarizing method...i just need to be able to
>>> reconcile the apparent variation in the final output. Or perhaps,
>>> better understand it.
>>>
>>> Your further thoughts, suggests greatly appreciated,
>>>
>>> karl
>>>
>>>
>>>
>>>
>>>
>>> on 9/11/2006 3:42 PM James W. MacDonald said the following:
>>>
>>>> k. brand wrote:
>>>>
>>>>> Dear All,
>>>>>
>>>>> I compared two normalization approaches for an experiment using
>>>>> twelve affy 430-2.0 chips. (histogram plot comparing bith methods
>>>>> forwarded on request).
>>>>>
>>>>> #1. RMA
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datarma <- rma(data)
>>>>> exprs2excel(datarma, file="dataRMA.csv")
>>>>>
>>>>> Plotting histograms of the output shows arrays NOT perfectly
>>>>> aligning at the means and spreads.
>>>>>
>>>>> I used a custom script to effect a quantile normalization on MAS5
>>>>> preprocessed but unnormalized data-
>>>>>
>>>>> #2. Mas5 sans interchip normalization
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datamas5sannorm <- mas5(data, normalize=FALSE)
>>>>> exprs2excel(datamas5sannorm, file="datamas5sannorm.csv")
>>>>> f.qnorm <- function(x,qinit=0.75,perc=100) {...
>>>>>
>>>>> The means and spreads of this normalization approach do align
>>>>> perfectly.
>>>>>
>>>>> THUS- summarizing probe intensites before or after normalization
>>>>> does appear to make a noticeable difference, as may be expected.
>>>>>
>>>>> My questions/requests-
>>>>>
>>>>> 1. Help to effect Bolstad normalization of the RMA preprocessed and
>>>>> summarized data. Whilst I succeed in generating unnormalized RMA
>>>>> preprocessed data with-
>>>>>
>>>>> library(affy)
>>>>> data <- ReadAffy()
>>>>> datarma <- rma(data, normalize=FALSE)
>>>>
>>>>
>>>>
>>>> Next step would be
>>>>
>>>> datarma <- normalize.quantiles(exprs(datarma))
>>>>
>>>> also note that 'data' is not a very good variable name, as you are
>>>> masking an existing function. When creating variable names it is
>>>> often enlightening to type the name first at an R prompt to see if
>>>> you get any response.
>>>>
>>>>
>>>>>
>>>>> As a result of my limited R experience, I failed in finding a
>>>>> method to effect Bolstad (quantile) normalization on this output.
>>>>>
>>>>> 2. Thoughts/comments on the benefits/caveats of normalizing before
>>>>> or after summarizing probe intensities.
>>>>
>>>>
>>>>
>>>> Normalizing after summarization for something like rma() seems
>>>> questionable to me. Since the expression values are based on fitting
>>>> a model to the PM probe values, if you don't normalize first you are
>>>> ignoring any non-biological variability which may end up biasing
>>>> your results. Using median polish for the model fit should help
>>>> protect against this, but I don't know that I would want to take
>>>> chances.
>>>>
>>>> As an aside, how far off are the histograms? Are you sure that there
>>>> is a reasonable difference? Eyeballing a histogram isn't the best
>>>> way to determine if the mean and variance are different or not. A
>>>> quick run through with some data here shows very little differences:
>>>>
>>>> > eset <- justRMA(filenames=list.celfiles()[1:10])
>>>> > apply(exprs(eset),2,summary)
>>>> Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7
>>>> Min. 4.085 4.070 4.091 4.051 4.068 4.090 4.087
>>>> 1st Qu. 5.835 5.859 5.832 5.812 5.842 5.858 5.852
>>>> Median 7.079 7.069 7.048 7.061 7.070 7.077 7.080
>>>> Mean 7.225 7.227 7.224 7.227 7.229 7.225 7.232
>>>> 3rd Qu. 8.352 8.324 8.351 8.363 8.361 8.330 8.347
>>>> Max. 14.550 14.440 14.420 14.400 14.490 14.430 14.260
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>>
>>>>> I look forward to any thoughts, advice & suggestions from users.
>>>>>
>>>>> thanks in advance,
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>> ===========================================
>>>>>
>>>>> > sessionInfo()
>>>>> Version 2.3.0 (2006-04-24)
>>>>> i386-pc-mingw32
>>>>>
>>>>> attached base packages:
>>>>> [1] "tools" "methods" "stats" "graphics" "grDevices"
>>>>> "utils"
>>>>> "datasets" "base"
>>>>>
>>>>> other attached packages:
>>>>> affy affyio Biobase
>>>>> "1.10.0" "1.0.0" "1.10.0"
>>>>>
>>>>
>>>>
>>>
>>
>>
>
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues.
More information about the Bioconductor
mailing list