[BioC] totalTest number in ChipPeakAnno

Thu Nov 18 17:26:52 CET 2010

Dear Noah,

Many thanks for your detailed explanation on how totalTest is defined. What I am doing is similar to the second case. However, the TF we are interested could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200 bps , the totalTest is 1.35E+7. It seems very computational costly to run ChIPpeakAnno. Nevertheless, do you think it is reasonable?

Thanks,

Binbin

On 16 Nov 2010, at 18:41, Noah Dowell wrote:

> Hello Binbin,
> 
> It would be helpful to describe your problem and post to the whole message board.  (There are many experts who probably can be more helpful than myself :-))  That said, I think you are referring to the "NaN" error and below are my thoughts (Julie Zhu also answered this a couple of times and her reply is probably in the archives).
> 
> 
> When calling the makeVennDiagram function you want to set the totalTest number to something that is larger than the experimentally determined peak number.  As far as I know, the totalTest number is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance.  So one way to sort this out using biological information is to think about the maximum number of possible binding events and use that as the totalTest number.  For example, if you are studying a sequence-specific DNA binding protein with a known motif you could count that number of times that motif occurs in the genome and compare that to the number of peaks you have experimentally determined.
> 
> Motifs = 500
> Peaks = 200
> Peaks w/ motif = 180 (90%)
> "upper limit"	   = 500
> new "upper limit" for totalTest = .9 x 500 = 450
> 
> Now if your working with a sequence-independent binding factor it can get tricky.  One approach would be to determine the mean peak width.  Then divide the whole genome sequence by this number to get an upper limit.  This is probably way to high so using additional information such as if the protein binds intergenic or ORFs could bring the number down but make it more relevant to the biological experiment.  For example:
> 
> peaks			= 75
> intergenic peaks	= 70
> ORF peaks		= 5
> mean peak width = 50 base pairs
> genome size 	= 10000 base pairs
> "upper limit" 		= 10000/ 50 = 200 (possible peaks)
> intergenic seq 	= 4000 base pairs
> new "upper limit"	=  4000/50 = 80 (possible intergenic peaks)
> 
> I was working with something more like the second case and I felt the totalTest based on the total genome was quite relaxed and based on the intergenic sequence only was quite stringent so somewhere in the middle might be better but most importantly I feel I am standing on some solid biological reasoning  for determining the amount of sampling.
> 
> Hope this helps and I would be interested to here if anybody has some critiques of this approach or additional suggestions.
> 
> Best,
> 
> Noah
> 
> 
> 
> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
> 
>> Dear Noah,
>> 
>> I saw your post on bioconductor mailing list regarding the totalTest number for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having the same problem. Can I ask how you got it sorted?
>> 
>> 
>> Many thanks.
>> 
>> Binbin
>