[BioC] Low number of replicates DESeq
Steve Lianoglou
lianoglou.steve at gene.com
Wed Feb 26 09:31:16 CET 2014
Hi,
In the future, you don't have to include all of the code you've ever
written :-) For instance, you have code to write tables, do DGE, do a
mean vs. sd plot, etc. but it's not really germane to this
conversation.
Also, your attachments didn't make it through to the list, but since I
was cc'd on the email, I got them in my inbox.
You are providing diagnostic plots (PCA and heatmap) from a
mult-factorial design, but you should note that the fact that you
encode the library type in the design will not change the output of
these plots. That is to say that the "libType" hasn't been corrected
for in the variance stabilized data. It is only accounted for in your
results form the DGE analysis.
Anyway: the PCA plot from the vst transformed data (your DESeq1 pca
plot) seems to suggest that your conditions are lining up pretty
reasonably along PC2 (vertically), while PC1 simply separates your
samples by libType -- would you agree with that assessment? (Forget
the pca from the rlog transform -- it seems the VST is doing better
here).
If this is the case, then encoding the libType as a main effect in
your model (as you've done) should go a long ways in dealing with this
issue for you.
How can you tell?
One way is to do a DGE analysis with just the unstranded data, then do
it again with the unstranded data AND stranded data using a design
matrix with libType + condition, and just test for condition. Are the
results between the two comparable when you look at the same
contrasts?
HTH,
-steve
On Tue, Feb 25, 2014 at 12:42 AM, Federico Gaiti <f.gaiti at uq.edu.au> wrote:
> Hi Steve,
>
> Please see my comments below
>
> Thanks
> ________________________________________
>
>>Thanks for pasting the code now -- without the images or results from
>>your output, it's hard to see what is reasonably correlated or not,
>>but let's continue ...
>
>>I see you're already doing PCA, and it would be interesting to see how
>>your results compare to what you get by following along with the
>>DESeq2 vignette and shooting your data through one of the variance
>>stabilizing transforms (incidentally, the DESeq vignette also has a
>>similar section, and it would have been worth while to browse that
>>vignette and follow the advice outlined there as well)
>
> This is exactly what I've done. I started doing a simple correlation anylsis and then shot my data first through DESeq and then through DESeq2.
> You can find atttached some PCA plots and heatmaps when I used the option -s no for the stranded data
>
> In DESeq2:
>
> CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
> samples<-data.frame(row.names=c("ADULT","ADULT1","ADULT2","ADULT3","JUV","JUV1","JUV2","JUV3","COMP","COMP1","COMP2","COMP3","PRECOMP","PRECOMP1","PRECOMP2","PRECOMP3"),condition=as.factor(c(rep("ADULT",4),rep("JUV",4),rep("COMP",4),rep("PRECOMP",4))))
> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=samples,design=~condition)
> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
> dds<-DESeq(dds)
> rld <- rlogTransformation(dds, blind=TRUE)
> rlogMat <- assay(rld)
> vstMat <- assay(vsd)
> vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
> px <- counts(dds)[,1] / sizeFactors(dds)[1]
> ord <- order(px)
> ord <- ord[px[ord] < 150]
> ord <- ord[seq(1, length(ord), length=50)]
> last <- ord[length(ord)]
> vstcol <- c("blue", "black")
> matplot(px[ord],cbind(assay(vsd)[, 1], log2(px))[ord, ],type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)")
> legend("bottomright",legend = c(expression("variance stabilizing transformation"),expression(log[2](n/s[1]))),fill=vstcol)
> library("vsn")
> par(mfrow=c(1,3))
> notAllZero <- (rowSums(counts(dds))>0)
> meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1),ylim = c(0,2.5))
> meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
> meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))
> library("RColorBrewer")
> library("gplots")
> select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]
> hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10,6))
> heatmap.2(assay(rld)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6))
> heatmap.2(assay(vsd)[select,], col = hmcol,Rowv = FALSE, Colv = FALSE, scale="none",dendrogram="none", trace="none", margin=c(10, 6))
> distsRL <- dist(t(assay(rld)))
> mat <- as.matrix(distsRL)
> rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
> print(plotPCA(rld, intgroup=c("condition")))
>
> In DESeq:
>
> data<-read.table("Counts",header=TRUE,row.names=1)
> Design=data.frame(row.names=colnames(data),condition=c('1','1','1','1','2','2','2','2','3','3','3','3','4','4','4','4'))
> cds=newCountDataSet(data,Design)
> cds=estimateSizeFactors(cds)
> cdsBlind=estimateDispersions(cds,method='blind')
> vsd=varianceStabilizingTransformation(cdsBlind)
> library("RColorBrewer")
> library("gplots")
> select = order(rowMeans(counts(cdsBlind)), decreasing=TRUE)[1:30]
> hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(exprs(vsd)[select,], col = hmcol, trace="none", margin=c(10, 6))
> heatmap.2(counts(cdsBlind)[select,], col = hmcol, trace="none", margin=c(10,6))
> dists = dist( t( exprs(vsd) ) )
> mat = as.matrix( dists )
> rownames(mat) = colnames(mat) = with(pData(cdsBlind), paste(condition))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
> print(plotPCA(vsd, intgroup=c("condition")))
>
>>Are you saying that if you use the stranded data but do the counting
>>as if it were unstranded, it looks "just fine," but counting the
>>stranded data as stranded data makes it look quite bad?
>
>>Interesting ....I've actually never used stranded RNAseq data before,
>>but that's a bit surprising ...
>
> Exactly. it looks way better than counting the stranded data as stranded data.
>
>>What if you only consider a subset of genes that are in regions that
>>do not have annotated genes on the opposite strands -- the data for
>>these genes from the stranded and unstranded should be quite similar,
>>no? How do these look?
>
> I'll have a look, good idea!
>
>>But why? If you're just doing "vanilla DGE" and you have data that you
>>feel like you can use (the unstranded data) in triplicate, and a
>>source of singlicate data (the stranded one), why do you want to
>>torture some DGE analysis by cramming the stranded data through it
>>assuming <something> from the unstranded.
>
>>If you really want to use the stranded data, I'd encode the "library
>>type" as a covariate in the linear model while doing differential
>>expression. Both the DESeq and DESeq2 vignettes have a section on
>>multi-factorial designs where they incorporate data from single and
>>paired-end runs to perform a DGE analysis -- the way you would combine
>>unstranded and stranded counts for a gene would be rather similar.
>
> I followed the multifactorial design section in both DESeq and DESeq2 vignette and then checked the quality by sample clustering and visualization.
>
> In DESeq:
>
> CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
> head(CountTable)
> Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT","ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded"))
> cdsFull=newCountDataSet(CountTable,Design)
> head(cdsFull)
> cdsFull=estimateSizeFactors(cdsFull)
> sizeFactors(cdsFull)
> cdsFull=estimateDispersions(cdsFull)
> plotDispEsts(cdsFull)
> fit1=fitNbinomGLMs(cdsFull,count ~ libType + condition)
> str(fit1)
> fit0=fitNbinomGLMs(cdsFull,count ~ libType )
> pvalsGLM = nbinomGLMTest( fit1, fit0 )
> padjGLM = p.adjust( pvalsGLM, method="BH" )
> head(fit1)
> cdsFullB=newCountDataSet(CountTable,Design$condition)
> cdsFullB=estimateSizeFactors(cdsFullB)
> cdsFullB=estimateDispersions(cdsFullB)
> rs = rowSums ( counts ( cdsFull ))
> theta = 0.4
> use = (rs > quantile(rs, probs=theta))
> table(use)
> cdsFilt = cdsFull[ use, ]
> fitFilt1 = fitNbinomGLMs( cdsFilt, count ~ libType + condition )
> fitFilt0 = fitNbinomGLMs( cdsFilt, count ~ libType )
> pvalsFilt = nbinomGLMTest( fitFilt1, fitFilt0 )
> padjFilt = p.adjust(pvalsFilt, method="BH" )
> padjFiltForComparison = rep(+Inf, length(padjGLM))
> padjFiltForComparison[use] = padjFilt
> tab3 = table( `no filtering` = padjGLM < .05,`with filtering` = padjFiltForComparison < .05 )
> addmargins(tab3)
> res=cbind(fit1,pvalsGLM=pvalsGLM,padjGLM=padjGLM)
> resSig=res[which(res$padjGLM<0.05),]
> table(res$padjGLM<0.05)
> write.csv(res,file='res_Multifactor_DESeq.csv')
> hist(res$pvalsGLM,breaks=100)
> resFilt=cbind(fitFilt1,pvalsFilt=pvalsFilt,padjFilt=padjFilt)
> resSigFilt=resFilt[which(resFilt$padjFilt<0.05),]
> table(resFilt$padjFilt<0.05)
> write.csv(resFilt,file='resFilt_Multifactor_DESeq.csv')
> plot(rank(rs)/length(rs), -log10(pvalsGLM), pch=16, cex=0.45)
> h1 = hist(pvalsGLM[!use], breaks=50, plot=FALSE)
> h2 = hist(pvalsGLM[use], breaks=50, plot=FALSE)
> colori = c(`do not pass`="khaki", `pass`="powderblue")
> barplot(height = rbind(h1$counts, h2$counts), beside = FALSE, col = colori,space = 0, main = "", ylab="frequency")
> text(x = c(0, length(h1$counts)), y = 0, label = paste(c(0,1)), adj = c(0.5,1.7), xpd=NA)
> legend("topright", fill=rev(colori), legend=rev(names(colori)))
> cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )
> vsdFull = varianceStabilizingTransformation( cdsFullBlind )
> select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:100]
> hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
> heatmap.2(counts(cdsFull)[select,], col = hmcol, trace="none", margin=c(10,6))
> dists = dist( t( exprs(vsdFull) ) )
> mat = as.matrix( dists )
> rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
> print(plotPCA(vsdFull, intgroup=c("condition", "libType")))
>
> In DESeq2:
>
> CountTable=read.table("DGE.txt",header=TRUE,row.names=1)
> head(CountTable)
> Design=data.frame(row.names=colnames(CountTable),condition=c("ADULT","ADULT","ADULT","ADULT","JUV","JUV","JUV","JUV","COMP","COMP","COMP","COMP","PRECOMP","PRECOMP","PRECOMP","PRECOMP"),libType=c("stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded","stranded","unstranded","unstranded","unstranded"))
> dds<-DESeqDataSetFromMatrix(countData=CountTable,colData=Design,design=~condition + libType)
> colData(dds)$condition<-factor(colData(dds)$condition,levels=c("PRECOMP","COMP","JUV","ADULT"))
> design(dds)<-formula(~libType + condition )
> dds<-DESeq(dds)
> rld <- rlogTransformation(dds, blind=TRUE)
> vsd <- varianceStabilizingTransformation(dds, blind=TRUE)
> vstMat <- assay(vsd)
> rlogMat <- assay(rld)
> px <- counts(dds)[,1] / sizeFactors(dds)[1]
> ord <- order(px)
> ord <- ord[px[ord] < 150]
> ord <- ord[seq(1, length(ord), length=50)]
> last <- ord[length(ord)]
> vstcol <- c("blue", "black")
> matplot(px[ord], cbind(assay(vsd)[, 1], log2(px))[ord, ], type="l", lty=1, col=vstcol, xlab="n", ylab="f(n)")
> legend("bottomright", legend = c(expression("variance stabilizing transformation"), expression(log[2](n/s[1]))),fill=vstcol)
> library("vsn")
> par(mfrow=c(1,3))
> notAllZero <- (rowSums(counts(dds))>0)
> meanSdPlot(log2(counts(dds,normalized=TRUE)[notAllZero,] + 1), ylim = c(0,2.5))
> meanSdPlot(assay(rld[notAllZero,]), ylim = c(0,2.5))
> meanSdPlot(assay(vsd[notAllZero,]), ylim = c(0,2.5))
> select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:100]
> hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6))
> heatmap.2(assay(rld)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6))
> heatmap.2(assay(vsd)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10, 6))
> distsRL <- dist(t(assay(rld)))
> select <- order(rowMeans(counts(dds,normalized=TRUE)),decreasing=TRUE)[1:30]
> hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(counts(dds,normalized=TRUE)[select,], col = hmcol, Rowv = FALSE, Colv = FALSE, scale="none", dendrogram="none", trace="none", margin=c(10,6))
> distsRL <- dist(t(assay(rld)))
> mat <- as.matrix(distsRL)
> rownames(mat) <- colnames(mat) <- with(colData(dds),
> paste(condition))
> rownames(mat) <- colnames(mat) <- with(colData(dds),paste(condition, libType, sep=" : "))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
> print(plotPCA(rld, intgroup=c("condition", "libType")))
>
> See attached the plots I obtained (*_MULTIFACTORIAL.jpeg)
>
> Let me know what you think
>
>>To be honest, if I had a set of data that had three replicates per
>>condition, and one set of data that had "singlicate" samples in a
>>condition which came from an apparently wildly different protocol, I'd
>>probably just use the triplicate data and forget about the outlier
>>data if I'm just doing "vanilla DGE".
>
>>If the stranded data is giving you wildly different results, the
>>simplest explanation might be that something may have gotten screwed
>>up in the library prep -- but you have no way to tell since you have
>>no replicates of those data.
>
>>As I said, you can include it in your linear model by using a
>>multi-factorial design if you *really* want to use it, but why bother
>>for a vanilla DGE analysis?
>
>>On the other hand, if you're using the stranded data to do something
>>more than just vanilla DGE -- perhaps you're interested in quantifying
>>anti-sense transcription, then I'd try to do a lot more exploratory
>>analysis to see how I could explain large difference in quantitation
>>that I'm seeing. I'd still be really pissed that I had no replicates,
>>but it wouldn't be the first time a graduate student is stuck with an
>>underpowered dataset that they are tasked to torture, and it sadly
>>won't be the last ... still, in this situation, I'd go the
>>multi-factorial design approach to combine both datasets for any
>>analysis I'm asked to do. In the meantime, I'd also complain to my
>>advisor/collaborator whatever to twist their arm enough to run
>>replicates for the future.
>
> You got the situation quite right.
> I agree with you, I wouldn't bother using the stranded data if I was doing a vanilla DGE but I am investigating long non-coding RNAs and so I need to have anti-sense transcription quantification. This is reason that brought me to include the untranded data in the analysis, to increase the number of replicates and test for DE to obtain a set of lncRNAs I can handle for downstream analysis.
>
> Let me know if you have any further idea
>
> Thanks a lot for help
>
> Federico
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Steve Lianoglou
Computational Biologist
Genentech
More information about the Bioconductor
mailing list