[BioC] totalTest number in ChipPeakAnno

Fri Nov 19 15:41:01 CET 2010

Binbin,

In the current implementation of makeVennDiagram, the time used to calculate
p-value does not depend on the totalTest.

Noah, thanks so much for sharing your insights!

Best regards,

Julie

On 11/18/10 11:26 AM, "Binbin Liu" <B.B.Liu at leeds.ac.uk> wrote:

> Dear Noah,
> 
> Many thanks for your detailed explanation on how totalTest is defined. What I
> am doing is similar to the second case. However, the TF we are interested
> could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200
> bps , the totalTest is 1.35E+7. It seems very computational costly to run
> ChIPpeakAnno. Nevertheless, do you think it is reasonable?
> 
> 
> Thanks,
> 
> Binbin
> 
> 
> On 16 Nov 2010, at 18:41, Noah Dowell wrote:
> 
>> Hello Binbin,
>> 
>> It would be helpful to describe your problem and post to the whole message
>> board.  (There are many experts who probably can be more helpful than myself
>> :-))  That said, I think you are referring to the "NaN" error and below are
>> my thoughts (Julie Zhu also answered this a couple of times and her reply is
>> probably in the archives).
>> 
>> 
>> When calling the makeVennDiagram function you want to set the totalTest
>> number to something that is larger than the experimentally determined peak
>> number.  As far as I know, the totalTest number is used for the
>> hypergeometric sampling that is used to determine if the overlap between two
>> datasets is more than would be expected by chance.  So one way to sort this
>> out using biological information is to think about the maximum number of
>> possible binding events and use that as the totalTest number.  For example,
>> if you are studying a sequence-specific DNA binding protein with a known
>> motif you could count that number of times that motif occurs in the genome
>> and compare that to the number of peaks you have experimentally determined.
>> 
>> Motifs = 500
>> Peaks = 200
>> Peaks w/ motif = 180 (90%)
>> "upper limit"    = 500
>> new "upper limit" for totalTest = .9 x 500 = 450
>> 
>> Now if your working with a sequence-independent binding factor it can get
>> tricky.  One approach would be to determine the mean peak width.  Then divide
>> the whole genome sequence by this number to get an upper limit.  This is
>> probably way to high so using additional information such as if the protein
>> binds intergenic or ORFs could bring the number down but make it more
>> relevant to the biological experiment.  For example:
>> 
>> peaks   = 75
>> intergenic peaks = 70
>> ORF peaks  = 5
>> mean peak width = 50 base pairs
>> genome size  = 10000 base pairs
>> "upper limit"   = 10000/ 50 = 200 (possible peaks)
>> intergenic seq  = 4000 base pairs
>> new "upper limit" =  4000/50 = 80 (possible intergenic peaks)
>> 
>> I was working with something more like the second case and I felt the
>> totalTest based on the total genome was quite relaxed and based on the
>> intergenic sequence only was quite stringent so somewhere in the middle might
>> be better but most importantly I feel I am standing on some solid biological
>> reasoning  for determining the amount of sampling.
>> 
>> Hope this helps and I would be interested to here if anybody has some
>> critiques of this approach or additional suggestions.
>> 
>> Best,
>> 
>> Noah
>> 
>> 
>> 
>> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
>> 
>>> Dear Noah,
>>> 
>>> I saw your post on bioconductor mailing list regarding the totalTest number
>>> for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having
>>> the same problem. Can I ask how you got it sorted?
>>> 
>>> 
>>> Many thanks.
>>> 
>>> Binbin
>> 
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>