[BioC] DEseq2 metagenomic analysis without replicates

Fri Jan 17 11:36:18 CET 2014

Hi

On 16/01/14 19:33, Kristina M Fontanez wrote:
> Your discussion about how to narrow down the large number of
> differentially abundant is quite interesting. If I am understanding you
> correctly you are suggesting that a reasonable cutoff on adjust p-values
> (0.1 or 10% false discovery rate) is simply too lenient for my data. 

No, not at all. Cut-offs below 1% on adjusted p values are very rarely
sensible.

Your issue is that your _question_ is "too lenient": The test provides
an answer to the following question:

"Does the OTU's abundance in the sample depend on the treatment
(sampling method)?"

A cut-off on adjusted p values at 10% provides you with a list of OTUs
for which this question can be answered with a clear "yes", such that
the "yes" is erroneous for at most 10% of the OTUs.

> I agree and that is why I initially dropped to 1e-5 however that number is
> really quite abritrary. 

With such a low FDR, you reduce the expected absolute number of false
positives in your list of several hundred OTUs from "less than very few"
to "probably none at all".

If nearly all your taxons look significant at 10% FDR, this simply means
that you can say with confidence that the abundance of most OTUs in the
sample depend on the sampling method.

Frankly, this does not sound at all surprising, and is hence not only a
plausible, but probably an entirely correct result. Only, it is not a
very useful or deep insight.

If you change to a lower FDR threshold, this does _not_ change the
wording of the question and is hence not helpful.

What you really want to know is not _which_ OTUs are affected by
treatment but maybe  _how_strong_ this effect is for each OTUs. This is
why I wrote that you should look at the fold changes and try to find
some biology in there.

So, instead of making a list of OTUs, based on a "yes/no" question, you
might consider a quantitive down-stream analysis, which includes _all_
OTUs, regardless of p value. For example, you could group your OTUs into
larger clades and ask whether the _average_ fold change in certain
clades are much larger than in the others.

> In my data, small standard errors (high precision) (< 0.3) are
> correlated with small log2fold changes (< 3) > and low mean of
> normalized counts across samples (< 1000). If I choose to only keep
> those taxa with small standard errors then I will be throwing away
> some of the most differentially abundant taxa in my dataset (log2fold
> changes > 3, mean counts > 1000 and standard errors > 0.3).

No, you should not use the standard errors to choose _which_ genes to
take. They are only useful if you do a quantitative analysis and want to
know how precise your inference of fold changes actually is.

> 2) Of those truly differentially abundant taxa, what log2fold changes
> are biologically meaningful?

This is one way, but possibly less insightful than a quantitative
analysis. However, if you follow this direction then you should use
DESeq2's new "thresholded" hypothesis testing that Mike mentioned.

> We have metagenomic samples of microbial communities taken at this
> location, at particular depths, going back several years (albeit
> sequenced with different genomic technologies). Previous research has
> shown that the microbial community in the seawater at any given depth
> doesn’t change very much from year to year. If I then make the
> assumption that any changes in the seawater community at a particular
> depth from one year to the next are the “baseline” level of variation
> that can be expected, I should be able to calculate the maximum log2fold
> change in abundance for any given taxa from one year to the next. I
> could then use that log2fold change as my threshold for biologically
> meaningful change when analyzing my treatments.

No, this rather sounds like an assessment of _statistical_, not
_biological_ significance. You propose to get replicate information, not
using samples from different depths (as I suggested last time) but from
different years. This is very useful and allows you to assess the
precision of your abundance measure, i.e., how close they get to the
true "population" values, which you would get from averaging over many
samples from different times. This is exactly what is meant by
statistical significance.

Biological significance means that the effect is strong enough that it
influences the system in a manner that supports the specific story you
want to tell in your paper -- and hence depends on the hypothesis you
want to investigate.

  Simon