[BioC] DESeq2 pitfalls of rlog transformation

Kristina M Fontanez fontanez at mit.edu
Fri Jan 24 19:53:09 CET 2014


Mike-

Thank you for your suggestion to try the regularized log transformation with blind=FALSE. The variance in the blind=FALSE data appears more similar to the untransformed variance.

                sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample11        sample12
OTU1206 norm_rld_blindtrue       136,280         93,429  131,636         93,351  120,957         93,334  120,358         93,317  93,304  93,310  93,308  93,322
OTU1206 norm_rld_blindfalse      117,964         2,594   107,115         1,816   84,599  1,623   83,377  1,399   1,212   1,295   1,262   1,465
OTU1206 raw      468,459         1,190   168,280         330     237,248         94      232,483         195     249     1,647   331     501

I also plotted the standard deviation of each transformation (blind=true and blind=false) against the mean as described in the DESeq manual to explore the effects of the different transformations on the variance.

> jes
class: DESeqDataSet
dim: 2151 12
exptData(0):
assays(1): counts
rownames(2151): OTU1 OTU2 ... OTU2150 OTU2151
rowData metadata column names(0):
colnames(12): HD5_150L HD5_150D ... HD9_SW_300_22 HD9_SW_500_22
colData names(5): DEPTH TREATMENT Cmg.m2.day Nmg.m2.day Pmg.m2.day

> design(jes)
~TREATMENT
> jes<-estimateSizeFactors(jes)
> jes<-estimateDispersions(jes,fitType="local")
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
> rld_jes<-rlogTransformation(jes,blind=FALSE)
> rld_trueblind_jes<-rlogTransformation(jes,blind=TRUE)
> par(mfrow=c(1,3))
> notAllZero<-(rowSums(counts(jes))>0)
> meanSdPlot(log2(counts(jes,normalized=TRUE)[notAllZero,]+1),ylim=c(0,2.5),main="Per taxa Stdev shifted logarithm")
> meanSdPlot(assay(rld_jes[notAllZero,]),ylim=c(0,2.5),main="Per taxa SD blind false")
> meanSdPlot(assay(rld_trueblind_jes[notAllZero,]),ylim=c(0,2.5),main="Per taxa SD blind true”)

Thank you for this suggestion.

Kristina

[cid:FC0E5165-A604-4F0D-B469-046A01034EB0 at mit.edu]
------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez at mit.edu<mailto:fontanez at mit.edu>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139



On Jan 24, 2014, at 11:51 AM, Michael Love <michaelisaiahlove at gmail.com<mailto:michaelisaiahlove at gmail.com>> wrote:

hi Kristina,

In my previous email I suggested using the default, blind=TRUE, but I think I was mistaken given your particular kind of data.  What is the point of the difference with this argument?  If you set blind=TRUE, we are supposing that, for most rows, there are no differences across conditions and we can estimate a global mean-dispersion trend line using the majority of the rows.  This fitted line is then used to inform the rlog tranformation.  But in your case, the null is clearly inappropriate for the majority of rows. The fitted line is producing very high estimates of dispersion, essentially the algorithm sees all this variation as all noise. Given that the design of ~ 1 is inappropriate for the majority of rows your dataset, I would recommend specifying your true experimental design and setting blind=FALSE. We should find a way to better inform this choice in the documentation.

Mike


On Fri, Jan 24, 2014 at 11:12 AM, Kristina M Fontanez <fontanez at mit.edu<mailto:fontanez at mit.edu>> wrote:
Dear Bioconductors-

After previous discussions on this list, I had settled on using regularized log transformed counts in order to present relative abundance heatmaps/barcharts of my metagenomic data. As this transformation accounts for both differing library sizes and stabilizes the variance among counts, it seemed an appropriate choice.

However, now I am running into the issue that the transformed data does not make biological sense. For example, below is a small table with 12 samples and the OTU counts for one taxa which have been either rlog transformed (using the blind option to ignore experimental design) or represent the raw counts. This particular taxa is one of the most abundant in my dataset. As you can see, the rlog transformed counts are much more stable across samples than the raw counts. This is exactly the expected behavior of the rlog transformation in that the variance of counts among samples has been reduced and the library sizes are now much more similar.

The problem is that the results do not make biological sense. This particular taxa is a copiotroph and so grows to high abundance in the presence of short bursts of nutrients. That kind of burst of growth is expected in samples 1,3,5, and 7, however samples 9-12 represent seawater samples and so they are nutrient-poor (relatively). This taxa ought to be in low abundance in those samples, which is the case in the raw counts. However, the rlog transformation inflates their counts so that they are now similar to those found in all other samples. This result obscures what is likely the true abundance of this taxa in the water column by inflating it beyond what is biologically reasonable given the environment in which these samples were collected.

        sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10        sample11        sample12
OTU1 rlog trans  136,280         93,429          131,635         93,351          120,957         93,334          120,358         93,317          93,304          93,310          93,308          93,322
OTU1 raw         468,459         1,190   168,280         330     237,248         94      232,483         195     249     1,647   331     501

With expression data, it seems appropriate to assume that in a treatment vs control scenario, all genes are likely to be expressed at some level (if even a low level) in all samples. However, I do not think that holds true for the abundance of taxa, especially in a case like mine where the samples are not so much different treatments as different ways of collecting microbes. It very well may be that the abundance of a particular taxa drops to a few hundred cells in any given sample because it is simply outcompeted by other microbes - that is a distinctly plausible biological event. The variance stabilization appears to mute that effect thereby making it appear that all taxa appear in all samples at relatively similar abundances.

Given these results, is rlog transformation really an appropriate choice when comparing metagenomic data? The variance stabilization seems to drastically change the biological patterns in the dataset.

Thanks,
Kristina

------------------------------------------------------------------
Kristina Fontanez, Postdoctoral Fellow
fontanez at mit.edu<mailto:fontanez at mit.edu><mailto:fontanez at mit.edu<mailto:fontanez at mit.edu>>
Massachusetts Institute of Technology
Department of Civil and Environmental Engineering
48-120E
15 Vassar Street
Cambridge, MA 02139




        [[alternative HTML version deleted]]

_______________________________________________
Bioconductor mailing list
Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
https://stat.ethz.ch/mailman/listinfo/bioconductor
Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor




More information about the Bioconductor mailing list