[BioC] Too many (?) differentially expressed genes - edgeR and DESeq

Mon Jul 22 17:44:14 CEST 2013

Darya
one possibility is that your three replicates each at day 0 and day3/4 are not really capturing enough biological variability (i.e. are too similar) and thus even small differences between the conditions are judged as significant by edgeR / DESeq.
	Best wishes
	Wolfgang

On 22 Jul 2013, at 06:00, Darya Vanichkina <d.vanichkina at gmail.com> wrote:

> Hi! 
> 
> I'm doing a differential gene expression analysis on two time points for a human iPS cell line, using 100bp PE RNA-seq data mapped with tophat and aggregated to generate a count table using htseq-count in intersection-strict mode. While  the differentiation process changes the phenotype of the cell line significantly, I'm still not sure I would expect >15% of genes in the genome to change significantly. 
> 
> I'm especially worried about the fact that after I filter for lowly expressed genes, ~50% of those for which testing is carried out are called significant. Naturally, I can filter based on cpm/logFC/detection by both tools before coming to a "significant" DG table, but I'm wondering if there is something intrinsically wrong with my data/samples, and if so - what? Or am I just making a very silly mistake in my code somewhere?
> 
> Thanks in advance for your help,
> Darya Vanichkina
> 
> 
> 
> edgeR BCV plot (Disp = 0.0189 , BCV = 0.1375 )
> 
> http://i.imgur.com/Pf32qeBh.png
> 
> edgeR smearplot
> 
> http://i.imgur.com/bs6JyoEh.png
> 
> DESeq MAplot
> http://i.imgur.com/Ds9nfwth.png
> 
> ## Sample code used
> # edgeR -----
> row.names(counttable.sam) <- counttable.sam$Gene
> counttable.sam$Gene <- NULL
> 
> y <- DGEList(counts=counttable.sam[1:6], group=rep(1:2,each=3), genes=row.names(counttable.sam))
> 
> keep <- rowSums(cpm(y)>1) >= 1
> y <- y[keep,]
> dim(y)
> # 16002    6
> # Kept, out of 62890 genes in reference (gencode 17). Reads were counted to gene features using htseq-count in intersection-strict mode
> 
> y$samples$lib.size <- colSums(y$counts)
> y <- calcNormFactors(y)
> plotMDS(y)
> y <- estimateCommonDisp(y, verbose=TRUE)
> # Disp = 0.0189 , BCV = 0.1375 
> y <- estimateTagwiseDisp(y)
> plotBCV(y)
> et <- exactTest(y)
> summary(de <- decideTestsDGE(et))
> # Out of 16002 tags for which the counts are not too low, 6263 are upregulated and 6235 are down-regulated
> detags <- rownames(y)[as.logical(de)]
> plotSmear(et, de.tags=detags)
> abline(h=c(-1, 1), col="blue")
> 
> 
> # DESeq -----
> # Filter the lowly expressed tags
> rs <- rowSums (counttable.sam)
> use <- (rs > 10)
> table(use)
> # use
> # FALSE  TRUE 
> # 41704 21186 
> counttable.sam <- counttable.sam[ use, ]
> conds <- factor(c("day0", "day0", "day0", "day34", "day34", "day34"))
> 
> d <- newCountDataSet(counttable.sam, conds)
> d <- estimateSizeFactors(d)
> sizeFactors(d)
> #  d0_1   d0_2   d0_3  d34_1  d34_2  d34_3 
> # 2.0770099 1.4882383 0.7514035 0.4643261 1.3976540 0.6846997
> d <- estimateDispersions(d, sharingMode="maximum")
> plotDispEsts(d) 
> DESeqRes <- nbinomTest(d, "day0", "day34")
> DESeqRes$GeneShort <- substr(DESeqRes$id, 1, 15)
> DESeqResSig <- subset(DESeqRes, padj < 0.01)
> # 10163/21186 genes were called significant by DESeq ...
> plotMA(DESeqRes)
> 
> 
> 
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> 
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
> 
> other attached packages:
> [1] DESeq_1.12.0         lattice_0.20-15      locfit_1.5-9.1       Biobase_2.20.1      
> [5] BiocGenerics_0.6.0   ggplot2_0.9.3.1      BiocInstaller_1.10.2 edgeR_3.2.4         
> [9] limma_3.16.6         biomaRt_2.16.0      
> 
> loaded via a namespace (and not attached):
> [1] annotate_1.38.0      AnnotationDbi_1.22.6 colorspace_1.2-2     DBI_0.2-7           
> [5] dichromat_2.0-0      digest_0.6.3         genefilter_1.42.0    geneplotter_1.38.0  
> [9] grid_3.0.1           gtable_0.1.2         IRanges_1.18.2       labeling_0.2        
> [13] MASS_7.3-27          munsell_0.4.2        plyr_1.8             proto_0.3-10        
> [17] RColorBrewer_1.0-5   RCurl_1.95-4.1       reshape2_1.2.2       RSQLite_0.11.4      
> [21] scales_0.2.3         splines_3.0.1        stats4_3.0.1         stringr_0.6.2       
> [25] survival_2.37-4      tools_3.0.1          XML_3.95-0.2         xtable_1.7-1   
> 
> 
> 
> ____________________
> Darya Vanichkina
> PhD Student in Bioinformatics
> IMB Science Ambassador
> Institute for Molecular Biosciences (IMB)
> University of Queensland Building 80
> St Lucia Queensland, 4072 Australia
> d.vanichkina at gmail.com
> d.vanichkina at imb.uq.edu.au
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor