[BioC] voom: CPM, FPKM and base counts

Gordon K Smyth smyth at wehi.EDU.AU
Wed May 22 02:13:37 CEST 2013


Hi Ryan,

It is true that voom doesn't depend on the discrete nature of count data, 
but it still needs to know the relative magnitude of the counts for 
different observations, otherwise there is no way to estimate the 
mean-variance relationship.

In my opinion, any high efficiency statistical method for RNA-seq data 
needs to accommodate the fact that larger counts are relatively more 
precise than smaller counts.  This is done by estimating the mean-variance 
relationship for the counts in some way.  See the voom preprint for some 
discussion:

   http://www.statsci.org/smyth/pubs/VoomPreprint.pdf

Voom doesn't need actual counts, but it still needs a quantity that 
preserves the ordering of the counts.  So you could indeed count a 
paired-end fragment as 1/2 if one end maps and other doesn't, or split 
reads across exons, and input the fractional counts to voom.  (I'm not 
recommended this as routine practice, just saying it would be 
statistically feasible.)

voom() can work with CPM or FPKM, but it needs to compute these quantities 
internally.  It can't accept FPKM as the primary input because FPKM does 
not preserve the ordering of the counts.  In my opinion, no general 
purpose high efficiency statistical analysis of RNA-seq data is possible 
using FPKM as primary input.  (Unless of course one also provides the 
library sizes and gene lengths from which the FPKM was computed, so that 
the software can map back to count size from the FPKM.)

If the sequencing depth is the same for all libraries, then the CPM are 
sufficient for statistical modelling, because in that case the CPM map 
back to count size through the library size.  In that case one could 
simply compute log-CPM and input it into limma using eBayes with 
trend=TRUE, and all would be fine.  That would be very similar to voom.

Best wishes
Gordon


On Tue, 21 May 2013, Ryan C. Thompson wrote:

> Gordon Smyth has noted previously on this list that limma's voom method 
> is happy to accept raw counts, CPM, FPKM, and base counts (read counts 
> times read length, allows splitting reads across exons). My 
> understanding is that voom doesn't depend or exploit the discrete nature 
> of count data that is fed to it, and can handle any data for which it 
> can properly model the mean-variance relationship (heteroskedasticity). 
> I'm sure Gordon could elaborate on this if I've missed anything.

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list