[BioC] DEseq2 metagenomic analysis without replicates

Simon Anders anders at embl.de
Mon Jan 13 21:25:05 CET 2014


On 13/01/14 20:08, Kristina Fontanez [guest] wrote:
> I am working with a metagenomic dataset in DESeq2 that does not have
> true biological replicates but has multiple factors and levels. Due
> to the difficulty in obtaining these samples, replicates were not an
> option. The study design consists of seawater microbes collected at 4
> depths. 

Out of genuine curiosity: Why are replicates not an option? Maybe I have
a wrong idea about metagenomics, but I now imagine that you went
somewhere with a ship and lowered a bucket (or, maybe, a more
sophisticated sample-taking apparatus ;-) ) four times, each time to a
different depth, to take a sample. Why would it have been prohibitively
expensive to let down the bucket twice for each depth? Even if you do
all sample takings at the very same spot, you should get a reasonable
estimate for variation within a depth layer, because your boat probably
drifts a few hundred metres anyway in the time.

> For each depth, there are 3 treatments (live, dead and none).

Can you explain more what treatments these are?

> I expect there to be true biological differences in abundance both
> among depths and among treatments. So, combining the samples across
> depths is not a great option because I expect there to be some
> variation in abundance of taxa among depths. The same holds true for
> combining samples across treatments. However, if I have to choose one
> I would expect there to be less variation among depths than among
> treatments.
> The goal is to assess the differences in abundance among treatments
> and among depths. Ultimately, I would like to create relative
> abundance heat maps and bar charts displaying the taxa which are
> differentially abundant among treatments and among depths. I would
> also like to cluster the samples using PCoA or nMDS to see the
> effects of treatment and depth on sample similarity.
> Is it possible to normalize these data (using variance-stabilized
> normalization or regularized log transformation) without choosing a
> study design? In my perusing of the mailing list, one suggestion was
> to use a design ~ 1 to estimate the dispersions so that the
> dispersions treat all samples as replicates. In my case, that is not
> an ideal choice due to the expected variation in abundance of taxa
> among depths and treatments.

The normalization does not use a study design, nor does the VST or rlog
transform. (Note that the term "normalization" here only refers to
accounting for sequencing depth, and this has little to do with study
design.) The respective DESeq2 functions all ignore the 'design' argument.

> 1) Make relative abundance heat maps and bar charts based on simple
> counts/library size proportions. 

You should base them on the rlog-transformed data, not on normalized counts.

> Later, to do differential abundance
> tests, pool samples across depths to compare the 3 treatments (4
> replicates per treatment), calculating the dispersion using
> ~Treatment as the study design.

I would be surprised if you find much this way. I don't know what the
treatments are but "live" and "dead" sounds like rather dramatic
differences, giving rise to very high dispersion estimates. If you are
lucky, you get some results nevertheless.

In the end, however, the best option for non-replicated data is to
_guess_ a good value for the dispersion instead of estimating it. For
example, instead of the usual "dds <- estimateDispersions(dds)", you
write "dispersions(dds) <- 0.1" to set the dispersion value for all
genes to your best guess, in this example 0.1.

Where to get the guess from? Well, hopefully you have some idea how much
these counts typically vary between two samples taken from the depth and
treated the same manner. If you think, they typically vary by 50%, then
set the dispersion to 0.5^2=0.25. (If you don't have any idea at all,
then you really should have tried to find out experimentally.)

Of course, this approach may only be used to analyse preliminary data.
After all, if you mention in a paper that you used a mere guess, the
reviewers will (hopefully) ask how you arrived at it. As you don't have
replicates, you cannot argue from your data that your guess is a
reasonable value. It may also be hard to argue from previously published
data, because metagenomics is too new a field to have much to compare
with, and you cannot know whether your data is of as good quality as the
one you based your guess on.

It should be clear that you cannot claim that a difference is related to
depth or treatment if you don't know whether differences of this
magnitude could as well have been found between two samples taken at the
same depth and treated the same way.

I have explained this in a bit more detail, because we haven't had the
topic on this mailing list for a while, and it might make sense from
time to time to remind newer readers why replicates (or better: data
from several independent samples) are so important.

> 2) Normalize count data using a design ~1 option in DEseq2, export
> normalized counts. As far I can tell - there isn’t a documented
> procedure to do this in manual, vignettes or on the mailing list. The
> main problem here is that I can’t figure out a way to export the
> regularized log transformed counts.

Again, the normalization has nothing to with the study design, as it
_only_ normalizes for sequencing depth. You will get always the same
normalized counts, which you can access with
   counts( dds, normalized=TRUE )
no matter what design formula you specify.

> One approach to (2) for the regularized log transformation is below
> and the variable Deptreat_rlog contains the regularized log
> transformed count data which is the same size as the original dataset
> (but the “counts” matrix is inaccessible):

Are you following some code example here? Because it looks that you are
mixing up two. The result of 'rlogData' is a matrix, you don't need
'assay' to extract the matrix. Simply save the content of 'rlogData'
into a CSV file, or pass the variable as is it to the heatmap function,
or do something else with it. You need to use 'assay' if you have used
'rlogTransformation' instead of 'rlogData'.


More information about the Bioconductor mailing list