[BioC] voom: CPM, FPKM and base counts
Gordon K Smyth
smyth at wehi.EDU.AU
Wed May 22 02:13:37 CEST 2013
Hi Ryan,
It is true that voom doesn't depend on the discrete nature of count data,
but it still needs to know the relative magnitude of the counts for
different observations, otherwise there is no way to estimate the
mean-variance relationship.
In my opinion, any high efficiency statistical method for RNA-seq data
needs to accommodate the fact that larger counts are relatively more
precise than smaller counts. This is done by estimating the mean-variance
relationship for the counts in some way. See the voom preprint for some
discussion:
http://www.statsci.org/smyth/pubs/VoomPreprint.pdf
Voom doesn't need actual counts, but it still needs a quantity that
preserves the ordering of the counts. So you could indeed count a
paired-end fragment as 1/2 if one end maps and other doesn't, or split
reads across exons, and input the fractional counts to voom. (I'm not
recommended this as routine practice, just saying it would be
statistically feasible.)
voom() can work with CPM or FPKM, but it needs to compute these quantities
internally. It can't accept FPKM as the primary input because FPKM does
not preserve the ordering of the counts. In my opinion, no general
purpose high efficiency statistical analysis of RNA-seq data is possible
using FPKM as primary input. (Unless of course one also provides the
library sizes and gene lengths from which the FPKM was computed, so that
the software can map back to count size from the FPKM.)
If the sequencing depth is the same for all libraries, then the CPM are
sufficient for statistical modelling, because in that case the CPM map
back to count size through the library size. In that case one could
simply compute log-CPM and input it into limma using eBayes with
trend=TRUE, and all would be fine. That would be very similar to voom.
Best wishes
Gordon
On Tue, 21 May 2013, Ryan C. Thompson wrote:
> Gordon Smyth has noted previously on this list that limma's voom method
> is happy to accept raw counts, CPM, FPKM, and base counts (read counts
> times read length, allows splitting reads across exons). My
> understanding is that voom doesn't depend or exploit the discrete nature
> of count data that is fed to it, and can handle any data for which it
> can properly model the mean-variance relationship (heteroskedasticity).
> I'm sure Gordon could elaborate on this if I've missed anything.
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list