[BioC] Too many (?) differentially expressed genes - edgeR and DESeq
Wolfgang Huber
whuber at embl.de
Mon Jul 22 17:44:14 CEST 2013
Darya
one possibility is that your three replicates each at day 0 and day3/4 are not really capturing enough biological variability (i.e. are too similar) and thus even small differences between the conditions are judged as significant by edgeR / DESeq.
Best wishes
Wolfgang
On 22 Jul 2013, at 06:00, Darya Vanichkina <d.vanichkina at gmail.com> wrote:
> Hi!
>
> I'm doing a differential gene expression analysis on two time points for a human iPS cell line, using 100bp PE RNA-seq data mapped with tophat and aggregated to generate a count table using htseq-count in intersection-strict mode. While the differentiation process changes the phenotype of the cell line significantly, I'm still not sure I would expect >15% of genes in the genome to change significantly.
>
> I'm especially worried about the fact that after I filter for lowly expressed genes, ~50% of those for which testing is carried out are called significant. Naturally, I can filter based on cpm/logFC/detection by both tools before coming to a "significant" DG table, but I'm wondering if there is something intrinsically wrong with my data/samples, and if so - what? Or am I just making a very silly mistake in my code somewhere?
>
> Thanks in advance for your help,
> Darya Vanichkina
>
>
>
> edgeR BCV plot (Disp = 0.0189 , BCV = 0.1375 )
>
> http://i.imgur.com/Pf32qeBh.png
>
> edgeR smearplot
>
> http://i.imgur.com/bs6JyoEh.png
>
> DESeq MAplot
> http://i.imgur.com/Ds9nfwth.png
>
> ## Sample code used
> # edgeR -----
> row.names(counttable.sam) <- counttable.sam$Gene
> counttable.sam$Gene <- NULL
>
> y <- DGEList(counts=counttable.sam[1:6], group=rep(1:2,each=3), genes=row.names(counttable.sam))
>
> keep <- rowSums(cpm(y)>1) >= 1
> y <- y[keep,]
> dim(y)
> # 16002 6
> # Kept, out of 62890 genes in reference (gencode 17). Reads were counted to gene features using htseq-count in intersection-strict mode
>
> y$samples$lib.size <- colSums(y$counts)
> y <- calcNormFactors(y)
> plotMDS(y)
> y <- estimateCommonDisp(y, verbose=TRUE)
> # Disp = 0.0189 , BCV = 0.1375
> y <- estimateTagwiseDisp(y)
> plotBCV(y)
> et <- exactTest(y)
> summary(de <- decideTestsDGE(et))
> # Out of 16002 tags for which the counts are not too low, 6263 are upregulated and 6235 are down-regulated
> detags <- rownames(y)[as.logical(de)]
> plotSmear(et, de.tags=detags)
> abline(h=c(-1, 1), col="blue")
>
>
> # DESeq -----
> # Filter the lowly expressed tags
> rs <- rowSums (counttable.sam)
> use <- (rs > 10)
> table(use)
> # use
> # FALSE TRUE
> # 41704 21186
> counttable.sam <- counttable.sam[ use, ]
> conds <- factor(c("day0", "day0", "day0", "day34", "day34", "day34"))
>
> d <- newCountDataSet(counttable.sam, conds)
> d <- estimateSizeFactors(d)
> sizeFactors(d)
> # d0_1 d0_2 d0_3 d34_1 d34_2 d34_3
> # 2.0770099 1.4882383 0.7514035 0.4643261 1.3976540 0.6846997
> d <- estimateDispersions(d, sharingMode="maximum")
> plotDispEsts(d)
> DESeqRes <- nbinomTest(d, "day0", "day34")
> DESeqRes$GeneShort <- substr(DESeqRes$id, 1, 15)
> DESeqResSig <- subset(DESeqRes, padj < 0.01)
> # 10163/21186 genes were called significant by DESeq ...
> plotMA(DESeqRes)
>
>
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] DESeq_1.12.0 lattice_0.20-15 locfit_1.5-9.1 Biobase_2.20.1
> [5] BiocGenerics_0.6.0 ggplot2_0.9.3.1 BiocInstaller_1.10.2 edgeR_3.2.4
> [9] limma_3.16.6 biomaRt_2.16.0
>
> loaded via a namespace (and not attached):
> [1] annotate_1.38.0 AnnotationDbi_1.22.6 colorspace_1.2-2 DBI_0.2-7
> [5] dichromat_2.0-0 digest_0.6.3 genefilter_1.42.0 geneplotter_1.38.0
> [9] grid_3.0.1 gtable_0.1.2 IRanges_1.18.2 labeling_0.2
> [13] MASS_7.3-27 munsell_0.4.2 plyr_1.8 proto_0.3-10
> [17] RColorBrewer_1.0-5 RCurl_1.95-4.1 reshape2_1.2.2 RSQLite_0.11.4
> [21] scales_0.2.3 splines_3.0.1 stats4_3.0.1 stringr_0.6.2
> [25] survival_2.37-4 tools_3.0.1 XML_3.95-0.2 xtable_1.7-1
>
>
>
> ____________________
> Darya Vanichkina
> PhD Student in Bioinformatics
> IMB Science Ambassador
> Institute for Molecular Biosciences (IMB)
> University of Queensland Building 80
> St Lucia Queensland, 4072 Australia
> d.vanichkina at gmail.com
> d.vanichkina at imb.uq.edu.au
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list