[BioC] RNAseq machine learning classifier
Simon Anders
anders at embl.de
Thu Jul 18 14:23:28 CEST 2013
Hi
>> My understanding is that the VST is applied on the count normalized
>> data, i.e., size factors must be estimated from the data. I
>> believe that they are also data-dependent. So I'm wondering when
>> one use the frozen parameters for the future data, how are the size
>> factors being computed? Does DESeq2 use the loggeomeans of the
>> future data to estimate the size factor, or does it have them in
>> the frozen parameters for reuse? Or does the choice matter to VST?
Size factors are estimated as follows:
First, we construct a "virtual reference", which is simply the geometric
mean of all counts:
geomeans <- exp( rowMeans( log( counts ) ) )
Then, for each sample j, the size factor is the median of this
sample's count values to the reference values:
sf[i] <- median( counts[i,] / geomeans )
If you want to use in a new data set frozen parameters from an old data
set, then, to be on the safe side, you might also want to use the
'geomeans' vector of the old data set to calculate the size factors for
the new data set.
The folloing code (untested) should do the trick:
loggeomeansOld <- exp( rowMeans( log( counts(ddsOld) ) ) )
sizeFactors( ddsNew ) <-
apply( counts(ddsNew), 2, function(cnts)
exp( median( (log(cnts) - loggeomeansOld)[
is.finite(loggeomeansOld) & (cnts>0) ] ) ) )
Even though the difference might not matter in practice, this here might
in fact be the cleaner way than recalculating the size factors in the
usual way.
Simon
More information about the Bioconductor
mailing list