[BioC] DESeq for an mRNA-seq time course
Elena Sorokin
sorokin at wisc.edu
Thu Jan 5 00:33:10 CET 2012
Hello everyone,
I'm trying to figure out whether the DESeq package and its GLM function
will work for a complex mRNA-seq time course. I have a time course of
five time points, with treated and untreated samples at each time point,
and multiple replicates. The preliminary results from this program make
me sure I haven't set up the contrast correctly, because the results
look wrong to me (more than half of the genome is differentially
expressed at an FDR <1%). Am I missing something? I note that in the
DESeq package vignette, the example case involves binary conditions
(treated/untreated vs paired/single end). Will DESeq even work for a
time course of five time points, or am I wasting my time? My script,
with output, is below. I apologize in advance - it is quite lengthy!
any help on this would be so fantastic!
Elena
> options(digits=3)
> setwd("C:\\Rdata\\DESeq\\GLMs")
> library(DESeq)
>
> # input time course data
> countsTable<-read.delim("input2_mut_ALL.txt",header=TRUE,
stringsAsFactors=TRUE)
> head(countsTable)
gene untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1 D.2h.2
U.2h.1 U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2 D.18h.1 D.18h.2 U.18h.1 U.18h.2
1 2L52.1 98 55 100 103 98 69 70
64 73 36 59 49 33 32 92 83 47 47
2 2RSSE.1 423 261 374 459 327 380 216
215 191 223 146 155 117 110 196 133
110 110
3 2RSSE.2 8 9 8 16 4 8 4
3 2 3 2 1 2 0 2 1 1 1
4 2RSSE.3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
5 3R5.1 74 38 96 46 69 52 56
41 38 32 31 14 28 31 38 33 81 81
6 3R5.2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
> rownames(countsTable) <- countsTable$gene
> countsTable <- countsTable [ , -1]
>
> # make data.frame
> design <- read.table("C:\\Rdata\\DESeq\\puf8lip1\\GLMs\\design2.txt",
header = T, row.names=1)
> design<- as.data.frame(design)
> is.data.frame(design)
[1] TRUE
>
> design
treat.type time.hr
untreat.1 none 0
untreat.2 none 0
D.1h.1 vehicle 1
D.1h.2 vehicle 1
U.1h.1 drug 1
U.1h.2 drug 1
D.2h.2 vehicle 2
D.2h.1 vehicle 2
U.2h.1 drug 2
U.2h.2 drug 2
D.6h.1 vehicle 6
D.6h.2 vehicle 6
U.6h.1 drug 6
U.6h.2 drug 6
D.18h.1 vehicle 18
D.18h.2 vehicle 18
U.18h.1 drug 18
U.18h.2 drug 18
>
> # create a count data object
> cds <- newCountDataSet( countsTable, design)
> head(counts(cds))
untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1 D.2h.2
U.2h.1 U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2 D.18h.1 D.18h.2 U.18h.1 U.18h.2
2L52.1 98 55 100 103 98 69 70 64
73 36 59 49 33 32 92 83 47 47
2RSSE.1 423 261 374 459 327 380 216 215
191 223 146 155 117 110 196 133 110 110
2RSSE.2 8 9 8 16 4 8 4
3 2 3 2 1 2 0 2 1 1 1
2RSSE.3 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
3R5.1 74 38 96 46 69 52 56 41
38 32 31 14 28 31 38 33 81 81
3R5.2 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
>
> # estimate library size of count data set
> cds <- estimateSizeFactors ( cds )
> sizeFactors( cds)
untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1
D.2h.2 U.2h.1 U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2
D.18h.1 D.18h.2
2.307 1.149 1.435 1.301 1.370 1.230
1.152 1.040 1.079 0.926 0.602 0.593 0.620
0.596 0.926 0.694
U.18h.1 U.18h.2
0.887 0.887
>
> # estimate dispersion (biological variation)
> # this is done for each condition/factor
> cds <- estimateDispersions( cds , method = "pooled")
>
> ### Check the fit ######
>
> plotDispEsts <- function( cds )
+ {
+ plot(
+ rowMeans( counts( cds, normalized=TRUE ) ),
+ fitInfo(cds)$perGeneDispEsts,
+ pch = 8, log="xy",xlab = "Per Gene Average Expression Level (in
counts)", ylab= "Per Gene Variance Estimate")
+ xg <- 10^seq( -.5, 5, length.out=300 )
+ lines( xg, fitInfo(cds)$dispFun( xg ), col="red", pch=16 )
+ }
>
> plotDispEsts(cds)
Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) :
6941 x values <= 0 omitted from logarithmic plot
2: In xy.coords(x, y, xlabel, ylabel, log) :
6968 y values <= 0 omitted from logarithmic plot
> head(fData(cds)) #dispersion values are stored in the feature data
slot of cds
disp_pooled
2L52.1 0.0245
2RSSE.1 0.0192
2RSSE.2 0.3018
2RSSE.3 Inf
3R5.1 0.0308
3R5.2 Inf
> str(fitInfo(cds)) #verify that the displ_pooled
List of 5
$ perGeneDispEsts: num [1:27924] 0.00641 0.0192 0.10196 NaN 0.02612 ...
$ dispFunc :function (q)
..- attr(*, "coefficients")= Named num [1:2] 0.00894 1.04347
.. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
..- attr(*, "fitType")= chr "parametric"
$ fittedDispEsts : num [1:27924] 0.0245 0.0137 0.3018 Inf 0.0308 ...
$ df : int 9
$ sharingMode : chr "maximum"
>
> ###### Fit data to Gen. Lin. Model #####
>
> fit1 <- fitNbinomGLMs(cds, count ~ treat.type + time.hr)
..................
There were 50 or more warnings (use warnings() to see the first 50)
> fit0 <- fitNbinomGLMs(cds, count ~ treat.type)
..................
Warning messages:
1: glm.fit: algorithm did not converge
2: glm.fit: algorithm did not converge
3: glm.fit: algorithm did not converge
4: glm.fit: algorithm did not converge
5: glm.fit: algorithm did not converge
>
> ###DIFF EXPRESSION ##########
>
> str(fit1)
'data.frame': 27924 obs. of 6 variables:
$ (Intercept) : num 6.24 8.16 2.67 NA 5.28 ...
$ treat.typenone : num -0.757 -0.484 -0.225 NA -0.255 ...
$ treat.typeU0126: num -0.532 -0.295 -0.713 NA 0.226 ...
$ time.hr : num 0.0171 -0.0392 -0.1143 NA 0.035 ...
$ deviance : num 13.95 22.92 8.29 NA 23.29 ...
$ converged : logi TRUE TRUE TRUE NA TRUE NA ...
- attr(*, "df.residual")= Named num 14
..- attr(*, "names")= chr "2L52.1"
> pvalsGLM <- nbinomGLMTest (fit1, fit0)
> padjGLM <- p.adjust (pvalsGLM, method = "BH") # correct for mult.test
> head(fit1) # hope to see fully-converged, ie TRUE
(Intercept) treat.typenone treat.typeU0126 time.hr deviance
converged
2L52.1 6.24 -0.757 -0.532 0.0171 13.95
TRUE
2RSSE.1 8.16 -0.484 -0.295 -0.0392 22.92
TRUE
2RSSE.2 2.67 -0.225 -0.713 -0.1143 8.29
TRUE
2RSSE.3 NA NA NA NA
NA NA
3R5.1 5.28 -0.255 0.226 0.0350 23.29
TRUE
3R5.2 NA NA NA NA
NA NA
> head(padjGLM)
[1] 1.73e-01 1.39e-05 3.66e-02 NA 6.29e-03 NA
>
> hist(pvalsGLM, breaks=100, col="skyblue", border="slateblue", main="")
>
>
> ######### Filter for significant genes at p < 0.05 ########
> res <- cbind(fit1,pvalsGLM,padjGLM)
> resSig <- res[ res$padjGLM < 0.05, ]
>
> ######### LAST PART: WRITE RESULTS TO .CSV FILE ###########
> write.table(resSig, file="DESeqOutput.txt", quote=FALSE)
>
> ############
> sessionInfo()
R version 2.14.0 (2011-10-31)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] pasilla_0.2.10 DESeq_1.6.1 locfit_1.5-6 lattice_0.20-0
akima_0.5-4 Biobase_2.14.0
loaded via a namespace (and not attached):
[1] annotate_1.32.0 AnnotationDbi_1.16.10 DBI_0.2-5
genefilter_1.36.0 geneplotter_1.32.1 grid_2.14.0
IRanges_1.12.5
[8] RColorBrewer_1.0-5 RSQLite_0.11.1 splines_2.14.0
survival_2.36-10 tools_2.14.0 xtable_1.6-0
>
More information about the Bioconductor
mailing list