[BioC] voom() vs. RPKM/FPKM or otherwise normalized counts, and GC correction, when fitting models to a small number of responses (per-feature counts)

Fri May 25 05:02:17 CEST 2012

Dear Tim,

I don't follow what you are trying to do scientifically, and this makes 
all the difference when deciding what are the appropriate tools to use.

If you are undertaking some sort of analysis that requires absolute gene 
(or feature) expression levels as responses, then you should not be using 
voom or limma or edgeR.  limma and edgeR do not estimate absolute 
expression.

If on the other hand, you want to detect differentially expressed genes 
(or features), which is what voom does, then there is no need to correct 
for gene length.  The comments of Section 2.3 of the edgeR User's Guide 
and especially 2.3.2 "Adjustments for gene length, GC content, mappability 
and so on" are also relevant for voom.  There is no need to correct for 
any characteristic of a gene that remains unchanged across samples.

A good case has been made that GC content can have differential influence 
across samples, but that doesn't apply to gene length.

voom does not work on RPKM or FPKM, or on the output from cufflinks. 
voom estimates a mean-variance relationship, and the variance is a 
function of count size, not of expression level.

Yes, you need limma to use the output from voom, because other softwares 
do not generally have the ability to use quantitative weights.  If you 
ignore the weights, then the output from voom is just logCPM, and you 
hardly need voom to compute that.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Thu, 24 May 2012, Tim Triche, Jr. wrote:

> Hi Dr. Smyth and Dr. Law,
>
> I have been reading the documentation for limma::voom() and trying to
> understand why there seems to be no correction for the size of the feature
> in the model:
>
> In an experiment, a count value is observed for each tag in each sample. A
> tag-wise mean-variance trend is computed using lowess. The tag-wise mean is
> the mean log2 count with an offset of 0.5, across samples for a given tag.
> The tag-wise variance is the quarter-root-variance of normalized log2
> counts per million values with an offset of 0.5, across samples for a given
> tag. Tags with zero counts across all samples are not included in the
> lowess fit. Optional normalization is performed using
> normalizeBetweenArrays. Using fitted values of log2 counts from a linear
> model fit by lmFit, variances from the mean-variance trend were
> interpolated for each observation. This was carried out by approxfun.
> Inverse variance weights can be used to correct for mean-variance trend in
> the count data.
>
>
> I don't see a reference to the feature size in all of this. (?)  Am I
> missing something?  Probably something major (like, say, the relationship
> of GC content or read length to variance)...
> Is the idea that features with similar sequence properties/size and
> abundance will have their mean-variance relationship modeled appropriately
> and weights generated empirically?
>
> For comparison, what I have been doing (in lieu of knowing any better) is
> as follows: align with Rsubread, run subjunc and splicegrapher, and count
> against exon/gene/feature models:
>
> alignedToRPKM <- function(readcounts) { # the output of featureCounts()
>  millionsMapped <- colSums(readcounts$counts)/1000000
>  if('ExonLength' %in% names(readcounts$annotation)) {
>    geneLengthsInKB <- readcounts$annotation$ExonLength/1000
>  } else {
>    geneLengthsInKB <- readcounts$annotation$GeneLength/1000 # works fine
> for ncRNA and splice graph edges
>  }
>
>  # example usage: readcounts$RPKM <- alignedToRPKM(readcounts)
>  return( sweep(readcounts$counts, 2, millionsMapped, '/') /
> geneLengthsInKB )
> }
>
> (When I did pretty much the same thing with Bowtie/TopHat/CuffLinks I got
> about the same results but slower, so I stuck with Rsubread.  And
> featureCounts() is really handy.)
>
> So, given the feature sizes in readcounts$annotation I can at least put
> things on something like a similar scale.  Most of my modeling currently is
> focused on penalized local regressions and thus a performant (but accurate)
> measure that can be used for linear modeling on a large scale is desirable.
> Is the output of voom() what I want?  Does one need to use limma/lmFit()
> to make use of voom()'s output?
>
> Last but not least, should I use something like Yuval Benjamini's GCcorrect
> package (http://www.stat.berkeley.edu/~yuvalb/YuvalWeb/Software.html)
> before/during/instead of voom()?
> And if the expression of a feature or several nearby features is often the
> response, does it matter a great deal what I use?
>
> Thanks for any input you might have time to provide.  I have to assume that
> the minds at WEHI periodically scheme together how best to go about these
> things...
>
>
> -- 
> *A model is a lie that helps you see the truth.*
> *
> *
> Howard Skipper<http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}