[BioC] DESeq for an mRNA-seq time course
Wolfgang Huber
whuber at embl.de
Thu Jan 5 09:38:21 CET 2012
Dear Elena
thank you for the detailed report. This is perhaps not the answer you
are expecting, but perhaps would it be better to first figure out _what_
you want to achieve before discussing the _how_?
When in a dataset like yours you test against the null hypothesis of 'no
effect of time', it is pretty plausible that the tests reject very many
genes.
For such time courses, I have often found it more instructive to cluster
the response patterns (e.g. using k-means on the profiles themselves or
on the differences between treatment and control) and then, if needed,
to set up statistical tests for a specific hypothesis induced from that
data exploration.
These tools are useful in that context:
- variance stabilising transformation of the count data (see DESeq
vignette) to bring them on a scale where the clustering metric (e.g.
Euclidean) is more appropriate,
- variance filtering, to first remove genes that do not chance
appreciably at all and thus to simplify the clustering,
- heatmaps.
Hope this helps
Wolfgang
Jan/5/12 12:33 AM, Elena Sorokin scripsit::
> Hello everyone,
>
> I'm trying to figure out whether the DESeq package and its GLM function
> will work for a complex mRNA-seq time course. I have a time course of
> five time points, with treated and untreated samples at each time point,
> and multiple replicates. The preliminary results from this program make
> me sure I haven't set up the contrast correctly, because the results
> look wrong to me (more than half of the genome is differentially
> expressed at an FDR <1%). Am I missing something? I note that in the
> DESeq package vignette, the example case involves binary conditions
> (treated/untreated vs paired/single end). Will DESeq even work for a
> time course of five time points, or am I wasting my time? My script,
> with output, is below. I apologize in advance - it is quite lengthy!
>
> any help on this would be so fantastic!
> Elena
>
> > options(digits=3)
> > setwd("C:\\Rdata\\DESeq\\GLMs")
> > library(DESeq)
> >
> > # input time course data
> > countsTable<-read.delim("input2_mut_ALL.txt",header=TRUE,
> stringsAsFactors=TRUE)
> > head(countsTable)
> gene untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1 D.2h.2 U.2h.1
> U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2 D.18h.1 D.18h.2 U.18h.1 U.18h.2
> 1 2L52.1 98 55 100 103 98 69 70 64 73 36 59 49 33 32 92 83 47 47
> 2 2RSSE.1 423 261 374 459 327 380 216 215 191 223 146 155 117 110 196
> 133 110 110
> 3 2RSSE.2 8 9 8 16 4 8 4 3 2 3 2 1 2 0 2 1 1 1
> 4 2RSSE.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 5 3R5.1 74 38 96 46 69 52 56 41 38 32 31 14 28 31 38 33 81 81
> 6 3R5.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
>
>
> > rownames(countsTable) <- countsTable$gene
> > countsTable <- countsTable [ , -1]
> >
> > # make data.frame
> > design <- read.table("C:\\Rdata\\DESeq\\puf8lip1\\GLMs\\design2.txt",
> header = T, row.names=1)
> > design<- as.data.frame(design)
> > is.data.frame(design)
> [1] TRUE
> >
> > design
> treat.type time.hr
> untreat.1 none 0
> untreat.2 none 0
> D.1h.1 vehicle 1
> D.1h.2 vehicle 1
> U.1h.1 drug 1
> U.1h.2 drug 1
> D.2h.2 vehicle 2
> D.2h.1 vehicle 2
> U.2h.1 drug 2
> U.2h.2 drug 2
> D.6h.1 vehicle 6
> D.6h.2 vehicle 6
> U.6h.1 drug 6
> U.6h.2 drug 6
> D.18h.1 vehicle 18
> D.18h.2 vehicle 18
> U.18h.1 drug 18
> U.18h.2 drug 18
> >
> > # create a count data object
> > cds <- newCountDataSet( countsTable, design)
> > head(counts(cds))
> untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1 D.2h.2 U.2h.1
> U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2 D.18h.1 D.18h.2 U.18h.1 U.18h.2
> 2L52.1 98 55 100 103 98 69 70 64 73 36 59 49 33 32 92 83 47 47
> 2RSSE.1 423 261 374 459 327 380 216 215 191 223 146 155 117 110 196 133
> 110 110
> 2RSSE.2 8 9 8 16 4 8 4 3 2 3 2 1 2 0 2 1 1 1
> 2RSSE.3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> 3R5.1 74 38 96 46 69 52 56 41 38 32 31 14 28 31 38 33 81 81
> 3R5.2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
> >
> > # estimate library size of count data set
> > cds <- estimateSizeFactors ( cds )
> > sizeFactors( cds)
> untreat.1 untreat.2 D.1h.1 D.1h.2 U.1h.1 U1h.2 D.2h.1 D.2h.2 U.2h.1
> U.2h.2 D.6h.1 D.6h.2 U.6h.1 U.6h.2 D.18h.1 D.18h.2
> 2.307 1.149 1.435 1.301 1.370 1.230 1.152 1.040 1.079 0.926 0.602 0.593
> 0.620 0.596 0.926 0.694
> U.18h.1 U.18h.2
> 0.887 0.887
> >
> > # estimate dispersion (biological variation)
> > # this is done for each condition/factor
> > cds <- estimateDispersions( cds , method = "pooled")
> >
> > ### Check the fit ######
> >
> > plotDispEsts <- function( cds )
> + {
> + plot(
> + rowMeans( counts( cds, normalized=TRUE ) ),
> + fitInfo(cds)$perGeneDispEsts,
> + pch = 8, log="xy",xlab = "Per Gene Average Expression Level (in
> counts)", ylab= "Per Gene Variance Estimate")
> + xg <- 10^seq( -.5, 5, length.out=300 )
> + lines( xg, fitInfo(cds)$dispFun( xg ), col="red", pch=16 )
> + }
> >
> > plotDispEsts(cds)
> Warning messages:
> 1: In xy.coords(x, y, xlabel, ylabel, log) :
> 6941 x values <= 0 omitted from logarithmic plot
> 2: In xy.coords(x, y, xlabel, ylabel, log) :
> 6968 y values <= 0 omitted from logarithmic plot
> > head(fData(cds)) #dispersion values are stored in the feature data
> slot of cds
> disp_pooled
> 2L52.1 0.0245
> 2RSSE.1 0.0192
> 2RSSE.2 0.3018
> 2RSSE.3 Inf
> 3R5.1 0.0308
> 3R5.2 Inf
> > str(fitInfo(cds)) #verify that the displ_pooled
> List of 5
> $ perGeneDispEsts: num [1:27924] 0.00641 0.0192 0.10196 NaN 0.02612 ...
> $ dispFunc :function (q)
> ..- attr(*, "coefficients")= Named num [1:2] 0.00894 1.04347
> .. ..- attr(*, "names")= chr [1:2] "asymptDisp" "extraPois"
> ..- attr(*, "fitType")= chr "parametric"
> $ fittedDispEsts : num [1:27924] 0.0245 0.0137 0.3018 Inf 0.0308 ...
> $ df : int 9
> $ sharingMode : chr "maximum"
> >
> > ###### Fit data to Gen. Lin. Model #####
> >
> > fit1 <- fitNbinomGLMs(cds, count ~ treat.type + time.hr)
> ..................
> There were 50 or more warnings (use warnings() to see the first 50)
> > fit0 <- fitNbinomGLMs(cds, count ~ treat.type)
> ..................
> Warning messages:
> 1: glm.fit: algorithm did not converge
> 2: glm.fit: algorithm did not converge
> 3: glm.fit: algorithm did not converge
> 4: glm.fit: algorithm did not converge
> 5: glm.fit: algorithm did not converge
> >
> > ###DIFF EXPRESSION ##########
> >
> > str(fit1)
> 'data.frame': 27924 obs. of 6 variables:
> $ (Intercept) : num 6.24 8.16 2.67 NA 5.28 ...
> $ treat.typenone : num -0.757 -0.484 -0.225 NA -0.255 ...
> $ treat.typeU0126: num -0.532 -0.295 -0.713 NA 0.226 ...
> $ time.hr : num 0.0171 -0.0392 -0.1143 NA 0.035 ...
> $ deviance : num 13.95 22.92 8.29 NA 23.29 ...
> $ converged : logi TRUE TRUE TRUE NA TRUE NA ...
> - attr(*, "df.residual")= Named num 14
> ..- attr(*, "names")= chr "2L52.1"
> > pvalsGLM <- nbinomGLMTest (fit1, fit0)
> > padjGLM <- p.adjust (pvalsGLM, method = "BH") # correct for mult.test
> > head(fit1) # hope to see fully-converged, ie TRUE
> (Intercept) treat.typenone treat.typeU0126 time.hr deviance converged
> 2L52.1 6.24 -0.757 -0.532 0.0171 13.95 TRUE
> 2RSSE.1 8.16 -0.484 -0.295 -0.0392 22.92 TRUE
> 2RSSE.2 2.67 -0.225 -0.713 -0.1143 8.29 TRUE
> 2RSSE.3 NA NA NA NA NA NA
> 3R5.1 5.28 -0.255 0.226 0.0350 23.29 TRUE
> 3R5.2 NA NA NA NA NA NA
> > head(padjGLM)
> [1] 1.73e-01 1.39e-05 3.66e-02 NA 6.29e-03 NA
> >
> > hist(pvalsGLM, breaks=100, col="skyblue", border="slateblue", main="")
> >
> >
> > ######### Filter for significant genes at p < 0.05 ########
> > res <- cbind(fit1,pvalsGLM,padjGLM)
> > resSig <- res[ res$padjGLM < 0.05, ]
> >
> > ######### LAST PART: WRITE RESULTS TO .CSV FILE ###########
> > write.table(resSig, file="DESeqOutput.txt", quote=FALSE)
> >
> > ############
> > sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
> States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
> [5] LC_TIME=English_United States.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] pasilla_0.2.10 DESeq_1.6.1 locfit_1.5-6 lattice_0.20-0 akima_0.5-4
> Biobase_2.14.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.32.0 AnnotationDbi_1.16.10 DBI_0.2-5 genefilter_1.36.0
> geneplotter_1.32.1 grid_2.14.0 IRanges_1.12.5
> [8] RColorBrewer_1.0-5 RSQLite_0.11.1 splines_2.14.0 survival_2.36-10
> tools_2.14.0 xtable_1.6-0
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Best wishes
Wolfgang
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list