[BioC] DESeq normalisation strategy
Simon Anders
anders at embl.de
Wed May 29 11:46:10 CEST 2013
Hi Davide
On 29/05/13 10:58, Davide Cittaro wrote:
> I've been reading about DESeq normalization strategy and, as far as I understand, it works on a sample basis: counts for each samples are normalized according to a factor calculated using the geometric mean of the counts.
> Three questions:
> - is this strategy robust when comparing samples with extremely different library sizes?
Sure, why shouldn't it be?
> - If I wanted to calculate cpm on normalized counts, should I rescale the library size according to the sizeFactor?
Actually, no. I assume that by "cpm", you mean "Counts per million",
which is a terse phrase meaning "number of reads mapped to the feature
per one million of aligned reads". As such, "cpm" is _defined_ to mean
the quantity that you get by dividing the counts for your feature by the
number of aligned reads and multiply by one million.
The notion of "calculating cpm on normalized counts" is hence a
contradiction in terms.
The whole point of DESeq's library size normalization is, of course,
that simply dividing by the number of aligned reads is not a good
strategy to get numbers which can be compared across samples, and that
hence cpm, RPKM, FPKM or any of the other variations on the "per
million" scheme are not useful quantities for differential analyses.
> - counts are calculated on genomic intervals, would the same approach make sense if I use counts on single nucleotides?
In principle, yes. The problem is that once your feature are very small,
very many of the counts may be zero, and the geometric mean of any set
of numbers containing at least one zero is zero. Hence, you can only use
feature with sufficiently high counts to get a stable estimate, and you
may not have enough of these.
Simon
More information about the Bioconductor
mailing list