[BioC] DESeq2 - regularised log transformation blind or not?

Wolfgang Huber whuber at embl.de
Mon Feb 24 16:50:23 CET 2014


Hi Mike S,

Mike L.’s explanation is consistent with the impression that the replicates seem slightly closer to each other (compare to the between cell type distances) in the blind=FALSE plot. I.e. you might be picking up somewhat more noise in the blind=TRUE case. It might also be worth exploring the ‘ntop’ argument of plotPCA.

	Best wishes
		Wolfgang


On 24 Feb 2014, at 17:22, Michael Love <michaelisaiahlove at gmail.com> wrote:

> hi Mike,
> 
> 
> On 24 Feb 2014, at 15:21, Mike Stubbington <mstubb at ebi.ac.uk> wrote:
> 
> > Hi,
> >
> > I have just been reading the updated vignette for DESeq2 in the bioconductor devel branch (http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf) and was interested by the comments in section 2.1.1 about the appropriateness of setting the blind argument when performing regularised log transformation. Specifically, the comment that
> >
> > “...blind dispersion estimation is not the appropriate choice if one expects that many or the majority of genes (rows) will have large differences in counts which are explanable by the experimental design…”
> >
> > Given this, I would really appreciate some further advice about when one should set blind=FALSE.
> >
> > For example, I am performing gene clustering using RNA-seq data for different six cell types. I would certainly expect a lot of genes to vary between the samples. Is this a case when blind=FALSE might be appropriate?
> >
> 
>> Yes, I think it would be appropriate to use blind=FALSE here. I added this note to the vignette after this discussion a month ago: 
> 
> https://stat.ethz.ch/pipermail/bioconductor/2014-January/057293.html
> 
> ​By default, the VST and rlog transformation use blind=TRUE, so that if people are using these transformations for quality assessment, the experimental design has absolutely no influence on the transformations (i.e. it is an unsupervised method).
> 
> When blind=FALSE, the experimental design is only used by the VST and rlog transformations in calculating the gene-wise dispersion estimates, in order to fit a trend line through the dispersions over the mean. Only the trend line is then used by the transformations, not the gene-wise estimates. Therefore, for visualization, clustering, or machine learning applications I tend to recommend blind=FALSE. 
> 
> The downside of setting blind=TRUE, is that large differences due to the experimental design (e.g., cell types in your case, or different water columns in the linked discussion above), will inflate the gene-wise dispersion estimates. When most of the genes contain such large differences across conditions, this will raise the trend-line, and then the transformed values will be greatly shrunken toward each other for most genes, which is an undesirable loss of signal. 
> 
> Mike
> 
> 



More information about the Bioconductor mailing list