[BioC] Normalization by DEseq
Simon Anders
anders at embl.de
Wed Oct 20 11:20:57 CEST 2010
Hi
On 10/19/2010 03:10 PM, Wolfgang Huber wrote:
> Normalisation: Briefly, the normalisation works as follows: if k_ij is
> the count of the i-th gene (or in your case, I guess, taxon) in the j-th
> sample, then we compute f_i as the geometric mean of these values across
> samples. The normalised count is k_ij / f_i.
Wolfgang forgot to mention the median, so let me try again:
First consider two replicate samples, indexed with j=1 and j=2. As they
are replicates, we expect the counts for a given gene i to always have
the same ratio k_i1 / k_i2. Of course, it will not be exactly the same
ratio, but if you plot a histogram of these ratios, you will see that
there is a clear, narrow peak, and its median (or its mode) should be a
good estimate for the actual sequencing depth ratio of the two samples.
Even if they are not replicates, the median should still be a reasonable
estimator as long as not more than half of the genes are differentially
expressed.
In order to deal with more than two samples, we define a fictive
"reference sample" against which to compare everything. This reference
sample has, as value for gene i, the geometric mean f_i over all samples
j of the counts k_ij.
Then, to get a size factor for sample j, we divide the counts for the
genes, k_ij, by the respective reference values f_j, and use the median
of all those ratios k_ij/f_i as the size factor:
s_j = median_j k_ij / f_i
>> I am wondering if I want to do differential expression analysis for
>> these two groups, should I filter out the group specific transcripts
>> before
>> putting into DEseq? Will this affect the normalization step?
Normally not, unless the H. pylori transcripts (which, I understand, you
expect to turn up only in the patients, not in the healthy control) make
up more than around half of the transcripts.
A useful diagnostic to double-check the size factors is a histogram of
the ratios. You can do this as follows:
library(DESeq)
# get an example count data set -- or use your data:
cds <- makeExampleCountDataSet()
# estimate the size factors:
cds <- estimateSizeFactors( cds )
# calculate the gene-wise geometric means
geomeans <- exp( rowMeans( log( counts(cds) ) ) )
# choose a sample whose size factor estimate we want to check
j <- 1
# just for clarity: the size factor was estimated as described above,
# so the following two lines give the same result
print( sizeFactors(cds)[j] )
print( median( ( counts(cds)[,j]/geomeans )[ geomeans>0 ] ) )
# Plot a histogram of the ratios of which we have just taken
# the median:
hist( log2( counts(cds)[,j] / geomeans ), breaks=100 )
# This histogram should be unimodal, with a clear peak at the value
# of the size factor. Mark it in red:
abline( v=log2( sizeFactors(cds)[ j ] ), col="red" )
If the histogram looks fine, the normalization has succeeded. If the
size factor does not hit the median, you can either filter down to genes
which are present in control and patient, as you suggested, or you could
also try to replace the median with another mode estimator, say, the
shorth (defined in the genefilter package). (I had once a data set where
the shorth seemed to work better than the median, but in general, I'd
stick to the median, unless the histogram seems weird.)
Best regards
Simon
More information about the Bioconductor
mailing list