[BioC] limma: paired analysis & effects on/interpretation of coefficients
James W. MacDonald
jmacdon at med.umich.edu
Fri Aug 26 15:22:14 CEST 2011
Hi Guido,
On 8/25/2011 4:20 PM, Hooiveld, Guido wrote:
> Hi,
>
> I am doing an analysis of a colleague's dataset, but I have troubles reproducing the results on fold changes he obtained by manually calculating these in Excel.
>
> The design of the experiment is as follows:
> - affymetrix arrays, RMA normalized
> - subjects got 3 treatments (T1, T2, T3), for now we focus on the differentially expressed genes between T2 and T1.
> - since multiple samples were obtained from the same subjects, a paired analysis is performed. Therefore I included 'Subject' as factor in the design matrix, in addition to Treatment.
> - the model includes both 'Treatment' and 'Subject'
> - in the contrast matrix (only) 'Treatment' is defined as contrast of interest.
> - limma runs without any problems.
>
> So far, so good.
>
> However, when comparing the mean of all individual FC, thus the coefficient of contrast 'T2vsT1' from fit2, we noticed this differs from the mean FC as manually calculated in Excel (by subtracting for each subject the T1 expression level from T2 [i.e. T2-T1]; subtract because data is on log2 scale). Example: for probeset 230_at limma coefficient (log FC) for 'T2vsT1' is -1.1007688, whereas by hand we find the mean log FC is -1.088240375. Not a large difference, but I expected it to be identical... The A value (Ave Expr) is identical between limma and manual calculation.
>
> Next i checked the average group expression values calculated by limma (coefs from object 'fit1' obtained when fitting the model; before doing the contasts). Then i noticed that the mean expression values (coefs) are different from the ones calculated by hand, and this *likely* causes the slightly differences. Moreover, i noticed a coefficient was also calculated for each Subject.
>
> Q1: why do the average expression levels differ between limma (from fit1) and excel? Are these 'adjusted' for 'Subject'. If so, why this wanted?
I think you are misinterpreting the coefficients in your model. You
don't give the entire design matrix, but I doubt that you have enough
data to fit the model the way you think you are. An example:
> library(limma)
> treat <- factor(rep(1:3, 6))
> subj <- factor(rep(1:6, each=3))
> design <- model.matrix(~0+treat+subj)
> design
treat1 treat2 treat3 subj2 subj3 subj4 subj5 subj6
1 1 0 0 0 0 0 0 0
2 0 1 0 0 0 0 0 0
3 0 0 1 0 0 0 0 0
4 1 0 0 1 0 0 0 0
5 0 1 0 1 0 0 0 0
6 0 0 1 1 0 0 0 0
7 1 0 0 0 1 0 0 0
8 0 1 0 0 1 0 0 0
9 0 0 1 0 1 0 0 0
10 1 0 0 0 0 1 0 0
11 0 1 0 0 0 1 0 0
12 0 0 1 0 0 1 0 0
13 1 0 0 0 0 0 1 0
14 0 1 0 0 0 0 1 0
15 0 0 1 0 0 0 1 0
16 1 0 0 0 0 0 0 1
17 0 1 0 0 0 0 0 1
18 0 0 1 0 0 0 0 1
Note that there is no column for subj1. This is because the model would
be over-specified if there were one, and the coefficients couldn't be
estimated. Now, for any row you are computing a coefficient for
whichever column has a 1. So if you look at the first row (which is
subject 1 who got treatment 1), there is only one 1, for treat1.
This means that treat1 is the coefficient for subj1treat1. Now jump to
row 4. This is subject 2 treatment 1. We are estimating a coefficient
for treat1 (subj1treat1) and subj2. Because this row corresponds to
subject 2 treatment 1, this implies that the coefficient 'subj2' is
estimating the difference between subj2 and subj1.
This is algebraically the same as doing a paired t-test. In the paired
t-test you estimate the treatment effect by subtracting within subjects,
and then summing that difference. By doing so, you are eliminating the
inter-subject differences.
Here we are estimating the treatment effect for subj1 and subtracting
off the differences between subj1 and all the other subjects. In the end
we have a treatment effect that is corrected for any intrinsic
differences between subjects. Essentially the same thing.
Now if you compare treat1 to treat2, you get the difference between
treatments, controlled for inter-subject differences, just like in the
paired t-test.
So long story short, you aren't getting the same estimated coefficient
because you aren't summing the right things.
> Q2: how to biologically interpret the coefficients that are returned for each subject? Is this a measure of how each of them deviate from the mean?
The coefficients for each subject are simply the difference between that
subject and subject 1. This is usually considered to be a nuisance
parameter, as it has no bearing on the experiment at hand, but has to be
estimated because your results will be biased otherwise.
Best,
Jim
>
> Any insights are appreciated, :)
> Guido
>
>> affy.data<- ReadAffy(cdfname="hugene11stv1hsentrezg")
>> x.norm<- rma(affy.data)
>>
>> targets<- readTargets("targets_FL.txt")
>> Treatment<- as.factor(targets$Treatment)
>> Subject<- as.factor(targets$Subject)
>>
>> design<- model.matrix(~0+Treatment + Subject)
>>
>> fit1<- lmFit(x.norm, design)
>>
>> cont.matrix<- makeContrasts(
> + #compareT2 vs T1 (=TreatmentT2-Treatment T1)
> + T2vsT1=(TreatmentT2-TreatmentT1),
> + T3avsT2=(TreatmentT3a-TreatmentT2),
> + T3bvsT2=(TreatmentT3b-TreatmentT2),
> + levels=design)
>>
>>
>> fit2<- contrasts.fit(fit1, cont.matrix)
>> fit2<- eBayes(fit2)
>> topTable(fit2,coef=1)
> ID logFC AveExpr t P.Value adj.P.Val
> 5076 230_at -1.1007688 10.479305 -11.944841 5.355630e-18 1.050882e-13
> 2396 129804_at 0.5968911 7.007159 11.576944 2.140830e-17 2.100368e-13
> 19071 9415_at -1.6726710 9.445871 -11.450699 3.457049e-17 2.261141e-13
> 7193 285754_at -1.0808572 7.133550 -10.945659 2.395265e-16 1.174997e-12
> 11167 51_at -0.5467982 9.673153 -10.853833 3.416245e-16 1.340671e-12
> 9286 3952_at -0.6370817 10.000592 -10.484375 1.438641e-15 4.704836e-12
> 13983 6319_at -1.0432746 12.180506 -10.236676 3.802185e-15 1.065807e-11
> 9494 4017_at -0.5091157 9.288021 -10.198771 4.414266e-15 1.082709e-11
> 8582 368_at -0.5483289 7.093417 -10.059173 7.658057e-15 1.669627e-11
> 13660 595_at -0.5011958 9.767599 -9.970447 1.087936e-14 2.134748e-11
> B
> 5076 30.42098
> 2396 29.09107
> 19071 28.63062
> 7193 26.76836
> 11167 26.42637
> 9286 25.04037
> 13983 24.10246
> 9494 23.95835
> 8582 23.42630
> 13660 23.08709
>
>
> ###########
> ## Output fit1
> ###########
>
>> fit1<- lmFit(x.norm, design)
>> fit1
> An object of class "MArrayLM"
> $coefficients
> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7
> 100009676_at 7.084174 7.101001 7.029655 7.090152 -0.21206182
> 10000_at 9.295713 9.338025 9.339492 9.359302 -0.06561657
> 10001_at 7.604867 7.636235 7.655008 7.665815 0.02216949
> 10002_at 4.375210 4.299118 4.393576 4.374674 -0.01616572
> 100033413_at 4.492079 4.453446 4.413175 4.522281 -0.39793930
> Subject12 Subject20 Subject24 Subject25 Subject26
> 100009676_at -0.120245077 -0.051913214 0.152116448 -0.30216929 -0.0817054427
> 10000_at 0.006550643 -0.095030864 -0.006489267 -0.13069982 0.0515787097
> <snip>
>
> $design
> TreatmentT1 TreatmentT2 TreatmentT3a TreatmentT3b Subject7 Subject12
> 1 1 0 0 0 0 0
> 2 0 0 1 0 0 0
> 3 1 0 0 0 0 0
> 4 0 1 0 0 0 0
> 5 0 0 1 0 0 0
> Subject20 Subject24 Subject25 Subject26 Subject29 Subject30 Subject32
> 1 0 0 0 0 0 0 0
> 2 0 0 0 0 0 0 0
> 3 0 0 0 0 1 0 0
> 4 0 0 0 0 1 0 0
> 5 0 0 0 0 1 0 0
> Subject34 Subject40 Subject43 Subject45 Subject47 Subject48 Subject51
> 1 0 0 0 0 0 0 0
> 2 0 0 0 0 0 0 0
> 3 0 0 0 0 0 0 0
> 4 0 0 0 0 0 0 0
> 5 0 0 0 0 0 0 0
> Subject52 Subject61 Subject64 Subject65 Subject66 Subject69 Subject71
> 1 0 0 0 0 0 1 0
> 2 0 0 0 0 0 1 0
> 3 0 0 0 0 0 0 0
> 4 0 0 0 0 0 0 0
> 5 0 0 0 0 0 0 0
> <snip>
>
> #######
> ## contrast matrix
> #######
>> cont.matrix
> Contrasts
> Levels T2vsT1 T3avsT2 T3bvsT2
> TreatmentT1 -1 0 0
> TreatmentT2 1 -1 -1
> TreatmentT3a 0 1 0
> TreatmentT3b 0 0 1
> Subject7 0 0 0
> Subject12 0 0 0
> Subject20 0 0 0
> Subject24 0 0 0
> Subject25 0 0 0
> Subject26 0 0 0
> <snip>
>
>> sessionInfo()
> R version 2.12.0 (2010-10-15)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] genefilter_1.32.0 annaffy_1.22.0
> [3] KEGG.db_2.4.5 GO.db_2.4.5
> [5] RSQLite_0.9-4 DBI_0.2-5
> [7] AnnotationDbi_1.12.1 qvalue_1.26.0
> [9] multtest_2.8.0 limma_3.6.9
> [11] gcrma_2.23.0-1 hugene11stv1hsentrezgcdf_13.0.0
> [13] affy_1.29.1 Biobase_2.10.0
>
> loaded via a namespace (and not attached):
> [1] affyio_1.19.2 annotate_1.28.0 Biostrings_2.18.0
> [4] IRanges_1.8.9 MASS_7.3-8 preprocessCore_1.12.0
> [7] splines_2.12.0 survival_2.35-8 tcltk_2.12.0
> [10] tools_2.12.0 xtable_1.5-6
>>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Douglas Lab
University of Michigan
Department of Human Genetics
5912 Buhl
1241 E. Catherine St.
Ann Arbor MI 48109-5618
734-615-7826
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues
More information about the Bioconductor
mailing list