[BioC] DEseq for chip-seq data normalisation

Ryan rct at thompsonclan.org
Fri Nov 8 01:03:32 CET 2013


It is true that TMM and similar methods are based on the assumption that 
most binding levels are similar across samples, and this assumption may 
not be true. However, Using raw library sizes (i.e. sum of all counts in 
a sample) also imposes an assumption that is at least as bad, which is 
that the total amount of binding in each sample is constant. In other 
words, using raw library size is equivalent to assuming that there is a 
fixed amount of binding that is allocated differently across the genome 
in each sample. In my opinion, if I had to choose one of those two 
assumptions, I would almost always choose the former and go with TMM.


On 11/6/13, 11:16 AM, Rory Stark wrote:
> Hi Ying-
> We actually just changed the default normalization from effective to full
> library size in the most recent release. The reason is that while
> effective is frequently a better choice, it is based on the assumption
> that overall binding levels in all the samples is similar. When this
> assumption is incorrect, it can result in substantially incorrect results;
> using full library size when effective applies results in less
> catastrophically wrong answers.
> I will definitely be changing normalization to use effective sizes when it
> is the right thing to do, but I have become aware that many (most?)
> DiffBind users don't change the defaults, so we determined that a more
> conservative default was preferable.
> I'm not sure what you're asking regarding "try and minimize changes
> between conditions" in this context?
> Cheers-
> Rory
> On 06/11/2013 19:08, "Ying Wu" <daiyingw at gmail.com> wrote:
>> Hi Rory,
>> Could you give some insight into why TMM is used with full library size,
>> it seems to make sense for effective library size case but where full
>> library size is used, would it still be valid to try and minimize
>> changes between conditions?
>> Best,
>> -Ying
>> On 11/05/13 18:18, Rory Stark wrote:
>>> Hi Guiseppe-
>>> You can retrieve the complete matrix of read counts from DiffBind,
>> either
>>> normalized or not, using dba.peakset with bRetrieve=TRUE. To can set
>> the
>>> score to use via dba.count with peaks=NULL and score=DBA_SCORE_READS,
>> or
>>> any of the other possible score values. The default score is
>>> DBA_SCORE_TMM_MINUS_FULL, which is normalized using edgeR's TMM method,
>>> after subtracting the reads in the control, and using the full library
>>> size (not just the reads in peaks) as a scalar.
>>> Cheers-
>>> Rory
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

More information about the Bioconductor mailing list