[BioC] limma: paired analysis & effects on/interpretation of coefficients
James W. MacDonald
jmacdon at med.umich.edu
Fri Aug 26 21:44:32 CEST 2011
Hi Guido,
On 8/26/2011 2:57 PM, Hooiveld, Guido wrote:
> Hi James,
>
> Thanks for your detailed response, much appreciated!
> I had again a look at my design matrix, and indeed there was no column for the first subject (ID=Subject2), just like in the example you provided. Moreover, after checking various aspects of our data set I am able to fully reproduce in Excel the average FCs calculated by limma. :)
>
> Please allow me to ask one more question:
> In the end it turned out that the slight difference between the FCs determined by limma vs manually Excel in our *complete* dataset is due to the fact we have missing data (arrays) for 3 subjects (in total we have 96 arrays from 33 subjects sampled after 3 treatments). To be specific, all 33 subjects have data for T1 and T3, but due to bad array quality, we had to drop for 3 subjects their T2 arrays. This somehow influences the calculation of the coefficients.
> Q: do you happen to know how limma deals with missing data? In other words, how exactly are the coefficients affected by these missing arrays? Apparently the 3 subjects with missing arrays for T2 are somehow not ignored/omitted when analysing contrast T2-T1, since I found differences in the outcomes of this contrast when using as input all 96 arrays, or only the 90 arrays of the subjects for which we have all 3 arrays. Our 'local statistician' is more a SAS than limma expert, and wasn't able to answer this question.
>
Well, I haven't thought it through completely, so if Gordon pops in with
a WHAT? OMG NO WAY!, consider yourself forewarned ;-D.
Remember that the treatment effect being estimated is a 'baseline'
treatment, and for all subjects that are not the first one, you adjust
for subject-specific differences by subtracting the means of the
expression values of the first subject from the mean of the expression
values of the subject under consideration (these are the sample
coefficients).
So in the case of T2-T1, you estimate the T1 mean using data from 33
subjects, whereas for T2 you estimate the mean using data from only 30
subjects. When you re-run using only 30 subjects, you are comparing the
mean of T1 and T2, both based on only 30 subjects. Hence the slight
differences.
> As a side note, talking about SAS, do you happen to know of a BioC/CRAN library optimized for array analysis that has same functionality as SAS' PROC MIXED? We would like to apply linear mixed effects model to analyse array data from an upcoming study.
>
I don't think you will find something in R that directly corresponds to
PROC MIXED. In limma, you can fit a mixed model using the correlation
argument to lmFit (after estimating the correlation structure using
duplicateCorrelation()). There is at least one example of this in the
limma User's Guide - see p. 38. In addition, Gordon has answered any
number of questions about mixed models on the list.
Either lmer or lme4 could be used to fit mixed models, and they are
probably most comparable to proc mixed. However, either would be really
slow, and neither (last I looked) will give you p-values.
Best,
Jim
> Kind regards,
> Guido
>
> -----Original Message-----
> From: James W. MacDonald [mailto:jmacdon at med.umich.edu]
> Sent: Friday, August 26, 2011 15:22
> To: Hooiveld, Guido
> Cc: bioconductor (bioconductor at stat.math.ethz.ch)
> Subject: Re: [BioC] limma: paired analysis& effects on/interpretation of coefficients
>
> Hi Guido,
>
> On 8/25/2011 4:20 PM, Hooiveld, Guido wrote:
>> Hi,
>>
>> I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel.
>>
>> The design of the experiment is as follows:
>> - affymetrix arrays, RMA normalized
>> - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1.
>> - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment.
>> - the model includes both 'Treatment' and 'Subject'
>> - in the contrast matrix (only) 'Treatment' is defined as contrast of interest.
>> - limma runs without any problems.
>>
>> So far, so good.
>>
>> However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation.
>>
>> Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject.
>>
>> Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted?
>
> I think you are misinterpreting the coefficients in your model. You don't give the entire design matrix, but I doubt that you have enough data to fit the model the way you think you are. An example:
>
> > library(limma)
> > treat<- factor(rep(1:3, 6))
> > subj<- factor(rep(1:6, each=3))
> > design<- model.matrix(~0+treat+subj)> design
> treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6
> 1 1 0 0 0 0 0 0 0
> 2 0 1 0 0 0 0 0 0
> 3 0 0 1 0 0 0 0 0
> 4 1 0 0 1 0 0 0 0
> 5 0 1 0 1 0 0 0 0
> 6 0 0 1 1 0 0 0 0
> 7 1 0 0 0 1 0 0 0
> 8 0 1 0 0 1 0 0 0
> 9 0 0 1 0 1 0 0 0
> 10 1 0 0 0 0 1 0 0
> 11 0 1 0 0 0 1 0 0
> 12 0 0 1 0 0 1 0 0
> 13 1 0 0 0 0 0 1 0
> 14 0 1 0 0 0 0 1 0
> 15 0 0 1 0 0 0 1 0
> 16 1 0 0 0 0 0 0 1
> 17 0 1 0 0 0 0 0 1
> 18 0 0 1 0 0 0 0 1
>
> Note that there is no column for subj1. This is because the model would be over-specified if there were one, and the coefficients couldn't be estimated. Now, for any row you are computing a coefficient for whichever column has a 1. So if you look at the first row (which is subject 1 who got treatment 1), there is only one 1, for treat1.
>
> This means that treat1 is the coefficient for subj1treat1. Now jump to row 4. This is subject 2 treatment 1. We are estimating a coefficient for treat1 (subj1treat1) and subj2. Because this row corresponds to subject 2 treatment 1, this implies that the coefficient 'subj2' is estimating the difference between subj2 and subj1.
>
> This is algebraically the same as doing a paired t-test. In the paired t-test you estimate the treatment effect by subtracting within subjects, and then summing that difference. By doing so, you are eliminating the inter-subject differences.
>
> Here we are estimating the treatment effect for subj1 and subtracting off the differences between subj1 and all the other subjects. In the end we have a treatment effect that is corrected for any intrinsic differences between subjects. Essentially the same thing.
>
> Now if you compare treat1 to treat2, you get the difference between treatments, controlled for inter-subject differences, just like in the paired t-test.
>
> So long story short, you aren't getting the same estimated coefficient because you aren't summing the right things.
>
>> Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean?
>
> The coefficients for each subject are simply the difference between that subject and subject 1. This is usually considered to be a nuisance parameter, as it has no bearing on the experiment at hand, but has to be estimated because your results will be biased otherwise.
>
> Best,
>
> Jim
>
>
>>
>> Any insights are appreciated, :)
>> Guido
>>
>>> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg")
>>> x.norm<- rma(affy.data)
>>>
>>> targets<- readTargets("targets_FL.txt")
>>> Treatment<- as.factor(targets$Treatment)
>>> Subject<- as.factor(targets$Subject)
>>>
>>> design<- model.matrix(~0+Treatment + Subject)
>>>
>>> fit1<- lmFit(x.norm, design)
>>>
>>> cont.matrix<- makeContrasts(
>> + #compareT2 vs T1 (=TreatmentT2-Treatment T1)
>> + T2vsT1=(TreatmentT2-TreatmentT1),
>> + T3avsT2=(TreatmentT3a-TreatmentT2),
>> + T3bvsT2=(TreatmentT3b-TreatmentT2),
>> + levels=design)
>>>
>>>
>>> fit2<- contrasts.fit(fit1, cont.matrix)
>>> fit2<- eBayes(fit2)
>>> topTable(fit2,coef=1)
>> ID logFC AveExpr t P.Value adj.P.Val
>> 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13
>> 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13
>> 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13
>> 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12
>> 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12
>> 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12
>> 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11
>> 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11
>> 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11
>> 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11
>> B
>> 5076 30.42098
>> 2396 29.09107
>> 19071 28.63062
>> 7193 26.76836
>> 11167 26.42637
>> 9286 25.04037
>> 13983 24.10246
>> 9494 23.95835
>> 8582 23.42630
>> 13660 23.08709
>>
>>
>> ###########
>> ## Output fit1
>> ###########
>>
>>> fit1<- lmFit(x.norm, design)
>>> fit1
>> An object of class "MArrayLM"
>> $coefficients
>> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7
>> 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182
>> 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657
>> 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949
>> 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572
>> 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930
>> Subject12 Subject20 Subject24 Subject25 Subject26
>> 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427
>> 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097
>> <snip>
>>
>> $design
>> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12
>> 1 1 0 0 0 0 0
>> 2 0 0 1 0 0 0
>> 3 1 0 0 0 0 0
>> 4 0 1 0 0 0 0
>> 5 0 0 1 0 0 0
>> Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32
>> 1 0 0 0 0 0 0 0
>> 2 0 0 0 0 0 0 0
>> 3 0 0 0 0 1 0 0
>> 4 0 0 0 0 1 0 0
>> 5 0 0 0 0 1 0 0
>> Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51
>> 1 0 0 0 0 0 0 0
>> 2 0 0 0 0 0 0 0
>> 3 0 0 0 0 0 0 0
>> 4 0 0 0 0 0 0 0
>> 5 0 0 0 0 0 0 0
>> Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71
>> 1 0 0 0 0 0 1 0
>> 2 0 0 0 0 0 1 0
>> 3 0 0 0 0 0 0 0
>> 4 0 0 0 0 0 0 0
>> 5 0 0 0 0 0 0 0
>> <snip>
>>
>> #######
>> ## contrast matrix
>> #######
>>> cont.matrix
>> Contrasts
>> Levels T2vsT1 T3avsT2 T3bvsT2
>> TreatmentT1 -1 0 0
>> TreatmentT2 1 -1 -1
>> TreatmentT3a 0 1 0
>> TreatmentT3b 0 0 1
>> Subject7 0 0 0
>> Subject12 0 0 0
>> Subject20 0 0 0
>> Subject24 0 0 0
>> Subject25 0 0 0
>> Subject26 0 0 0
>> <snip>
>>
>>> sessionInfo()
>> R version 2.12.0 (2010-10-15)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> other attached packages:
>> [1] genefilter_1.32.0 annaffy_1.22.0
>> [3] KEGG.db_2.4.5 GO.db_2.4.5
>> [5] RSQLite_0.9-4 DBI_0.2-5
>> [7] AnnotationDbi_1.12.1 qvalue_1.26.0
>> [9] multtest_2.8.0 limma_3.6.9
>> [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0
>> [13] affy_1.29.1 Biobase_2.10.0
>>
>> loaded via a namespace (and not attached):
>> [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0
>> [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0
>> [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0
>> [10] tools_2.12.0 xtable_1.5-6
>>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Douglas Lab
> University of Michigan
> Department of Human Genetics
> 5912 Buhl
> 1241 E. Catherine St.
> Ann Arbor MI 48109-5618
> 734-615-7826
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
>
>
>
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list