[BioC] edgeR: paired samples DGE and GLM

Tue Mar 12 20:26:31 CET 2013

Hi,

See this previous discussion for a discussion of the pseudocounts: 
https://stat.ethz.ch/pipermail/bioconductor/2013-March/051182.html

Only the non-GLM method makes use of pseudocounts, so it's not important 
that they are missing from the GLM analysis. I believe the design rows 
and count columns *do* need to match up in the same order. To do a 
paired test, you can just use model.matrix(~0+Treatment+Patient) and 
proceed as you already have. The GLM method does give non-identical 
results, and in general it seems that people are assuming that the GLM 
method is better overall, though as always, proof is scarce when 
sequencing is so expensive.

Hope this answers your questions.

-Ryan Thompson

On 03/12/2013 04:20 AM, Alessandra [guest] wrote:
> Hi,
>
> I am trying to use edgeR to compute differential gene expression.
>
> I have a quite simple experimental design:
> labExpId Patient sex    Treatment
> sample.4  1   F            A
> sample.3  2   M           A
> sample.5  2   M           B
> sample.7  3   F            A
> sample.0  4   M           A
> sample.8  3   F            B
> sample.1  4   M           B
> sample.2  1   F            B
>
>
> I am comparing the effect of treatment across all patients, and it is fine when I apply exactTest. Then I wanted to include also the Patient in my model and I try using GLM, but I found several problems. So, I tried to apply GLM only including the Treatment just to troubleshoot. I follow the instructions on page 22 from the user's guide revised in December 2012.
>
> My counts are stored in a matrix called m.
>
>> colnames(m)
> [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" sample.8"
> [7] "sample.0" "sample.1"
>
>> design=model.matrix(~0+Treatment)
>> design
>            TreatmentA          TreatmentB
> sample.4             1               0
> sample.3             1               0
> sample.5             0               1
> sample.7             1               0
> sample.0             1               0
> sample.8             0               1
> sample.1             0               1
> sample.2             0               1
>
>> M = DGEList(counts=na.omit(m))
>> cpm.m <- cpm(M)
>> M <- M[rowSums(cpm.m >1) >=1,]
>> M <- calcNormFactors(M)
>> M <- estimateGLMCommonDisp(M, design, verbose=T)
>> names(M)
> [1] "counts"            "samples"           "abundance"
> [4] "logCPM"            "common.dispersion"
>
> First thing I notice, I don't see pseudo.counts and other attributes I saw when calling estimateCommonDisp().
>
>> M <- estimateGLMTrendedDisp(M, design)
>> M <- estimateGLMTagwiseDisp(M, design)
>> fit <- glmFit(M, design)
>> lrt <- glmLRT(fit, contrast=c(-1,1))
>> res = topTags(lrt, n=nrow(lrt))$table
>> head(res)
>                       logFC    logCPM       LR       PValue          FDR
> ENSG00000244115  15.512665  2.266789 35.30878 2.813602e-09 3.482958e-05
> ENSG00000256329 -17.760869  4.514903 32.89653 9.719681e-09 6.015996e-05
> ENSG00000223638  11.777617 -1.467638 18.46673 1.728967e-05 7.134295e-02
> ENSG00000134321  -5.328058  5.959204 16.76318 4.234703e-05 1.310535e-01
> ENSG00000196861   7.776972 -1.722111 15.83143 6.924259e-05 1.494412e-01
> ENSG00000229292  11.368590 -1.876601 15.74621 7.243290e-05 1.494412e-01
>
>> m[rownames(head(res)),]
>                  sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
> ENSG00000244115         0         0         0         0         0      6412
> ENSG00000256329         0         0         0     32070     0         0
> ENSG00000223638         0         0         2         0         0        0
> ENSG00000134321      1178       590   890     83130   754   1116
> ENSG00000196861         0         3       363       0         0        53
> ENSG00000229292         0         0         0         0         0         0
>
>                  sample.0 sample.1
> ENSG00000244115         0      4415
> ENSG00000256329         0         0
> ENSG00000223638       434       550
> ENSG00000134321       852      1092
> ENSG00000196861       500        57
> ENSG00000229292       344       406
>
> I realized that it may be a problem of the ordering of the design matrix rows
>
>> colnames(m)
> [1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" "sample.8"
> [7] "sample.0" "sample.1"
>> rownames(design)
> [1] "sample.4" "sample.3" "sample.5" "sample.7" "sample.0" "sample.8"
> [7] "sample.1" "sample.2"
>
> so I redo it forcing the row names of my design matrix to be the same order as the column names in the matrix count.
> Now my design matrix looks like:
>
>> design
>            TreatmentA TreatmentB
> sample.2             0               1
> sample.3             1               0
> sample.4             1               0
> sample.5             0               1
> sample.7             1               0
> sample.8             0               1
> sample.0             1               0
> sample.1             0               1
>
> I execute exactly the same commands as before.
> Again I cannot see the pseudocounts.
>> names(M)
> [1] "counts"            "samples"           "abundance"
> [4] "logCPM"            "common.dispersion"
>
>> head(res)
>                      logFC     logCPM       LR PValue FDR
> ENSG00000074855 -32.17749  0.2759965 2121.104      0   0
> ENSG00000076351 -32.17749 -0.4306713 1549.345      0   0
> ENSG00000105552 -32.17749 -0.4864164 1502.658      0   0
> ENSG00000106991 -32.17749  0.4622641 2144.947      0   0
> ENSG00000123009 -32.17749 -0.2422835 1719.625      0   0
> ENSG00000133216 -32.17749  0.2206127 2127.575      0   0
>
> I get very different results still from the exactTest with no GLM.
> I attribute this to the missing pseudocounts in the DGEList object?
>
>> m[rownames(head(res)),]
>                  sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
> ENSG00000074855         0       948       902         0       676         0
> ENSG00000076351         0       494       522         0       538         0
> ENSG00000105552         0       490       454         0       486         0
> ENSG00000106991         0      1094       798        0       698         0
> ENSG00000123009         0       556       520         0       566         0
> ENSG00000133216         0       808       762         0       730         0
>                  sample.0 sample.1
> ENSG00000074855      1166        0
> ENSG00000076351       698         0
> ENSG00000105552       764         0
> ENSG00000106991      1733        0
> ENSG00000123009       980         0
> ENSG00000133216      1304        0
>
> I see that these genes have all zero counts for TreatmentB and this explains the low logFC, but I don't get this when I don't use GLM at all (exactTest).
>
> So in summary, my questions are:
>
> 1) Does the ordering of rows in the design matrix have to be the same as the column order in the count matrix,
> or the sample name is taken into account so the order should not matter?
>
> 2) Do you have any idea why I don't get pseudocounts after estimating the dispersion, and is it really this that causes such low logFC?
>
> I am looking forward to hearing from you.
>
> Many thanks in advance,
> Ale
>
>
>   -- output of sessionInfo():
>
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] splines   stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] plyr_1.7.1      optparse_1.0.1  ggplot2_0.9.3.1 reshape2_1.2.1
> [5] edgeR_3.0.8     limma_3.14.4
>
> loaded via a namespace (and not attached):
>   [1] MASS_7.3-17        RColorBrewer_1.0-5 colorspace_1.1-1   dichromat_1.2-4
>   [5] digest_0.5.2       getopt_1.19.1      grid_2.15.0        gtable_0.1.2
>   [9] labeling_0.1       munsell_0.3        proto_0.3-9.2      scales_0.2.3
> [13] stringr_0.6        tools_2.15.0
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor