[BioC] edgeR: paired samples DGE and GLM
Alessandra [guest]
guest at bioconductor.org
Tue Mar 12 12:20:08 CET 2013
Hi,
I am trying to use edgeR to compute differential gene expression.
I have a quite simple experimental design:
labExpId Patient sex Treatment
sample.4 1 F A
sample.3 2 M A
sample.5 2 M B
sample.7 3 F A
sample.0 4 M A
sample.8 3 F B
sample.1 4 M B
sample.2 1 F B
I am comparing the effect of treatment across all patients, and it is fine when I apply exactTest. Then I wanted to include also the Patient in my model and I try using GLM, but I found several problems. So, I tried to apply GLM only including the Treatment just to troubleshoot. I follow the instructions on page 22 from the user's guide revised in December 2012.
My counts are stored in a matrix called m.
>colnames(m)
[1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" sample.8"
[7] "sample.0" "sample.1"
>design=model.matrix(~0+Treatment)
> design
TreatmentA TreatmentB
sample.4 1 0
sample.3 1 0
sample.5 0 1
sample.7 1 0
sample.0 1 0
sample.8 0 1
sample.1 0 1
sample.2 0 1
>M = DGEList(counts=na.omit(m))
>cpm.m <- cpm(M)
>M <- M[rowSums(cpm.m >1) >=1,]
>M <- calcNormFactors(M)
>M <- estimateGLMCommonDisp(M, design, verbose=T)
> names(M)
[1] "counts" "samples" "abundance"
[4] "logCPM" "common.dispersion"
First thing I notice, I don't see pseudo.counts and other attributes I saw when calling estimateCommonDisp().
> M <- estimateGLMTrendedDisp(M, design)
> M <- estimateGLMTagwiseDisp(M, design)
> fit <- glmFit(M, design)
> lrt <- glmLRT(fit, contrast=c(-1,1))
> res = topTags(lrt, n=nrow(lrt))$table
> head(res)
logFC logCPM LR PValue FDR
ENSG00000244115 15.512665 2.266789 35.30878 2.813602e-09 3.482958e-05
ENSG00000256329 -17.760869 4.514903 32.89653 9.719681e-09 6.015996e-05
ENSG00000223638 11.777617 -1.467638 18.46673 1.728967e-05 7.134295e-02
ENSG00000134321 -5.328058 5.959204 16.76318 4.234703e-05 1.310535e-01
ENSG00000196861 7.776972 -1.722111 15.83143 6.924259e-05 1.494412e-01
ENSG00000229292 11.368590 -1.876601 15.74621 7.243290e-05 1.494412e-01
> m[rownames(head(res)),]
sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
ENSG00000244115 0 0 0 0 0 6412
ENSG00000256329 0 0 0 32070 0 0
ENSG00000223638 0 0 2 0 0 0
ENSG00000134321 1178 590 890 83130 754 1116
ENSG00000196861 0 3 363 0 0 53
ENSG00000229292 0 0 0 0 0 0
sample.0 sample.1
ENSG00000244115 0 4415
ENSG00000256329 0 0
ENSG00000223638 434 550
ENSG00000134321 852 1092
ENSG00000196861 500 57
ENSG00000229292 344 406
I realized that it may be a problem of the ordering of the design matrix rows
> colnames(m)
[1] "sample.2" "sample.3" "sample.4" "sample.5" "sample.7" "sample.8"
[7] "sample.0" "sample.1"
> rownames(design)
[1] "sample.4" "sample.3" "sample.5" "sample.7" "sample.0" "sample.8"
[7] "sample.1" "sample.2"
so I redo it forcing the row names of my design matrix to be the same order as the column names in the matrix count.
Now my design matrix looks like:
> design
TreatmentA TreatmentB
sample.2 0 1
sample.3 1 0
sample.4 1 0
sample.5 0 1
sample.7 1 0
sample.8 0 1
sample.0 1 0
sample.1 0 1
I execute exactly the same commands as before.
Again I cannot see the pseudocounts.
> names(M)
[1] "counts" "samples" "abundance"
[4] "logCPM" "common.dispersion"
> head(res)
logFC logCPM LR PValue FDR
ENSG00000074855 -32.17749 0.2759965 2121.104 0 0
ENSG00000076351 -32.17749 -0.4306713 1549.345 0 0
ENSG00000105552 -32.17749 -0.4864164 1502.658 0 0
ENSG00000106991 -32.17749 0.4622641 2144.947 0 0
ENSG00000123009 -32.17749 -0.2422835 1719.625 0 0
ENSG00000133216 -32.17749 0.2206127 2127.575 0 0
I get very different results still from the exactTest with no GLM.
I attribute this to the missing pseudocounts in the DGEList object?
> m[rownames(head(res)),]
sample.2 sample.3 sample.4 sample.5 sample.7 sample.8
ENSG00000074855 0 948 902 0 676 0
ENSG00000076351 0 494 522 0 538 0
ENSG00000105552 0 490 454 0 486 0
ENSG00000106991 0 1094 798 0 698 0
ENSG00000123009 0 556 520 0 566 0
ENSG00000133216 0 808 762 0 730 0
sample.0 sample.1
ENSG00000074855 1166 0
ENSG00000076351 698 0
ENSG00000105552 764 0
ENSG00000106991 1733 0
ENSG00000123009 980 0
ENSG00000133216 1304 0
I see that these genes have all zero counts for TreatmentB and this explains the low logFC, but I don't get this when I don't use GLM at all (exactTest).
So in summary, my questions are:
1) Does the ordering of rows in the design matrix have to be the same as the column order in the count matrix,
or the sample name is taken into account so the order should not matter?
2) Do you have any idea why I don't get pseudocounts after estimating the dispersion, and is it really this that causes such low logFC?
I am looking forward to hearing from you.
Many thanks in advance,
Ale
-- output of sessionInfo():
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] C
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] plyr_1.7.1 optparse_1.0.1 ggplot2_0.9.3.1 reshape2_1.2.1
[5] edgeR_3.0.8 limma_3.14.4
loaded via a namespace (and not attached):
[1] MASS_7.3-17 RColorBrewer_1.0-5 colorspace_1.1-1 dichromat_1.2-4
[5] digest_0.5.2 getopt_1.19.1 grid_2.15.0 gtable_0.1.2
[9] labeling_0.1 munsell_0.3 proto_0.3-9.2 scales_0.2.3
[13] stringr_0.6 tools_2.15.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list