[BioC] RNAseq machine learning classifier

Thu Jul 18 14:23:28 CEST 2013

Hi

>> My understanding is that the VST is applied on the count normalized
>> data, i.e., size factors must be estimated from the data.  I
>> believe that they are also data-dependent.  So I'm wondering when
>> one use the frozen parameters for the future data, how are the size
>> factors being computed?  Does DESeq2 use the loggeomeans of the
>> future data to estimate the size factor, or does it have them in
>> the frozen parameters for reuse? Or does the choice matter to VST?

Size factors are estimated as follows:

First, we construct a "virtual reference", which is simply the geometric 
mean of all counts:

   geomeans <- exp( rowMeans( log( counts ) ) )

Then, for each sample j, the size factor is the median of this
sample's count values to the reference values:

   sf[i] <- median( counts[i,] / geomeans )

If you want to use in a new data set frozen parameters from an old data 
set, then, to be on the safe side, you might also want to use the 
'geomeans' vector of the old data set to calculate the size factors for 
the new data set.

The folloing code (untested) should do the trick:

   loggeomeansOld <- exp( rowMeans( log( counts(ddsOld) ) ) )
   sizeFactors( ddsNew ) <-
      apply( counts(ddsNew), 2, function(cnts)
         exp( median( (log(cnts) - loggeomeansOld)[
            is.finite(loggeomeansOld) & (cnts>0) ] ) ) )

Even though the difference might not matter in practice, this here might 
in fact be the cleaner way than recalculating the size factors in the 
usual way.

   Simon