[BioC] limma multiple groups comparison produces different pvalue comparing with two group comparsion

Wed Mar 14 14:29:08 CET 2012

Hi Jim,

Thanks very much for the thorough explanation. The pvalue is also smaller
in real data, indicating that multiple group
Comparison is more reasonable in general.

Best,

-- 
Chunxuan

On 3/14/12 2:12 PM, "James W. MacDonald" <jmacdon at uw.edu> wrote:

>Hi Chunxuan,
>
>
>On 3/13/2012 4:53 AM, Shao, Chunxuan wrote:
>> Hi everyone:
>>
>> I am trying limma for multiple group comparison, and found that the
>>pvalue of pairwise comparison is different from the two group comparsion.
>>
>> A simple example:
>>
>> ## multiple groups
>> sd<- 0.3*sqrt(4/rchisq(100,df=4))
>> y2<- matrix(rnorm(100*9,sd=sd),100,9)
>> rownames(y2)<- paste("Gene",1:100)
>> y2[1:2,4:6]<- y2[1:2,4:6] + 2
>> y2[1:2,7:9]<- y2[1:2,7:9] + 2
>> f<- factor(c(rep("A",3),rep("B",3),rep("C",3)),levels=c("A","B","C"))
>> design<- model.matrix(~0+f)
>> colnames(design)<- c("A","B","C")
>> fit2<- lmFit(y2,design)
>> contrast.matrix<- makeContrasts("B-A", "C-A", "C-B", levels=design)
>> fit2<- contrasts.fit(fit2, contrast.matrix)
>> fit2<- eBayes(fit2)
>> topTable(fit2, adjust="BH",coef=1)
>>
>>          ID      logFC         t      P.Value   adj.P.Val          B
>> 1   Gene 1  1.8961975  8.697056 1.038672e-05 0.001038672  3.9011567
>> 2   Gene 2  2.3140816  6.104506 1.690945e-04 0.008454727  0.9859443
>> 96 Gene 96 -0.8959120 -4.039216 2.855951e-03 0.095198377 -1.9787088
>> 49 Gene 49 -0.4983110 -2.776108 2.128499e-02 0.462011867 -4.0378662
>> 86 Gene 86  0.7288518  2.726366 2.310059e-02 0.462011867 -4.1198456
>> 74 Gene 74 -0.8456774 -2.367919 4.171492e-02 0.695248750 -4.7045985
>> 50 Gene 50 -1.5597167 -2.107329 6.396187e-02 0.783183079 -5.1177738
>> 46 Gene 46 -0.6530683 -2.028684 7.269084e-02 0.783183079 -5.2394326
>> 8   Gene 8 -0.3348446 -1.986208 7.786841e-02 0.783183079 -5.3044289
>> 89 Gene 89 -0.4574447 -1.942438 8.356956e-02 0.783183079 -5.3708387
>>
>> ## two groups
>> y<- y2[,1:6]
>> design<- cbind(Grp1=1,Grp2vs1=c(0,0,0,1,1,1))
>> options(digit=3)
>> fit<- lmFit(y,design)
>> fit<- eBayes(fit)
>> topTable(fit, adjust="BH",coef= "Grp2vs1")
>>
>>          ID      logFC         t      P.Value  adj.P.Val          B
>> 1   Gene 1  1.8961975  7.962269 0.0001316933 0.01316933  1.6748248
>> 2   Gene 2  2.3140816  6.190071 0.0005783671 0.02891836  0.1114107
>> 96 Gene 96 -0.8959120 -4.124112 0.0050946000 0.16982000 -2.2231052
>> 64 Gene 64 -0.6611682 -3.824833 0.0073401109 0.18350277 -2.6140362
>> 86 Gene 86  0.7288518  3.503703 0.0110265957 0.22053191 -3.0478852
>> 49 Gene 49 -0.4983110 -2.922543 0.0239288895 0.39881483 -3.8654279
>> 74 Gene 74 -0.8456774 -2.411705 0.0489411131 0.69915876 -4.6045844
>> 89 Gene 89 -0.4574447 -2.282469 0.0588626872 0.73578359 -4.7916531
>> 50 Gene 50 -1.5597167 -2.064728 0.0804679751 0.79599004 -5.1040241
>> 46 Gene 46 -0.6530683 -2.020227 0.0857867706 0.79599004 -5.1671760
>>
>>
>> In this case,  in the multiple group p_value is much smaller than the
>>two group.
>> Any suggestion to this ? Which is the appropriate way to do if I want
>>to do the pairwise comparison among three groups?
>
>Technically either way is appropriate, but the first case will be more
>powerful. In the second case you are doing a conventional t-test, where
>the denominator is the standard error of the mean (SEM), computed from
>the two groups under consideration. In the first case you are fitting a
>contrast in the context of a larger linear model, in which case the
>denominator is the sums of squares of error (SSE), which is an estimate
>of the intra-group variability over all groups.
>
>Both the SEM and SSE are measuring the same thing, which is the within
>group variability. Measures of variability tend to be inefficient
>statistics, which means that the accuracy of the statistic is highly
>dependent on the number of observations (and they can be quite
>inaccurate with few observations). This is the underlying premise of the
>eBayes step in limma, which uses the variability estimate over all genes
>(which will be pretty accurate) to adjust the variability estimate of an
>individual gene (which may not be accurate, especially with few samples).
>
>In your case, the SSE computed over three groups will tend to be a more
>accurate (and generally smaller) measure of the within group variability
>than the SEM computed over two groups. Since the numerator of your
>statistic is the same in both cases, computing a smaller denominator
>will increase the t-statistic, which will result in smaller p-values,
>and hence more power to detect differences.
>
>Best,
>
>Jim
>
>
>>
>> Thanks in advance.
>>
>> --
>> Chunxuan
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>-- 
>James W. MacDonald, M.S.
>Biostatistician
>University of Washington
>Environmental and Occupational Health Sciences
>4225 Roosevelt Way NE, # 100
>Seattle WA 98105-6099
>