[BioC] totalTest number in ChipPeakAnno
Zhu, Lihua (Julie)
Julie.Zhu at umassmed.edu
Fri Nov 19 15:41:01 CET 2010
Binbin,
In the current implementation of makeVennDiagram, the time used to calculate
p-value does not depend on the totalTest.
Noah, thanks so much for sharing your insights!
Best regards,
Julie
On 11/18/10 11:26 AM, "Binbin Liu" <B.B.Liu at leeds.ac.uk> wrote:
> Dear Noah,
>
> Many thanks for your detailed explanation on how totalTest is defined. What I
> am doing is similar to the second case. However, the TF we are interested
> could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200
> bps , the totalTest is 1.35E+7. It seems very computational costly to run
> ChIPpeakAnno. Nevertheless, do you think it is reasonable?
>
>
> Thanks,
>
> Binbin
>
>
> On 16 Nov 2010, at 18:41, Noah Dowell wrote:
>
>> Hello Binbin,
>>
>> It would be helpful to describe your problem and post to the whole message
>> board. (There are many experts who probably can be more helpful than myself
>> :-)) That said, I think you are referring to the "NaN" error and below are
>> my thoughts (Julie Zhu also answered this a couple of times and her reply is
>> probably in the archives).
>>
>>
>> When calling the makeVennDiagram function you want to set the totalTest
>> number to something that is larger than the experimentally determined peak
>> number. As far as I know, the totalTest number is used for the
>> hypergeometric sampling that is used to determine if the overlap between two
>> datasets is more than would be expected by chance. So one way to sort this
>> out using biological information is to think about the maximum number of
>> possible binding events and use that as the totalTest number. For example,
>> if you are studying a sequence-specific DNA binding protein with a known
>> motif you could count that number of times that motif occurs in the genome
>> and compare that to the number of peaks you have experimentally determined.
>>
>> Motifs = 500
>> Peaks = 200
>> Peaks w/ motif = 180 (90%)
>> "upper limit" = 500
>> new "upper limit" for totalTest = .9 x 500 = 450
>>
>> Now if your working with a sequence-independent binding factor it can get
>> tricky. One approach would be to determine the mean peak width. Then divide
>> the whole genome sequence by this number to get an upper limit. This is
>> probably way to high so using additional information such as if the protein
>> binds intergenic or ORFs could bring the number down but make it more
>> relevant to the biological experiment. For example:
>>
>> peaks = 75
>> intergenic peaks = 70
>> ORF peaks = 5
>> mean peak width = 50 base pairs
>> genome size = 10000 base pairs
>> "upper limit" = 10000/ 50 = 200 (possible peaks)
>> intergenic seq = 4000 base pairs
>> new "upper limit" = 4000/50 = 80 (possible intergenic peaks)
>>
>> I was working with something more like the second case and I felt the
>> totalTest based on the total genome was quite relaxed and based on the
>> intergenic sequence only was quite stringent so somewhere in the middle might
>> be better but most importantly I feel I am standing on some solid biological
>> reasoning for determining the amount of sampling.
>>
>> Hope this helps and I would be interested to here if anybody has some
>> critiques of this approach or additional suggestions.
>>
>> Best,
>>
>> Noah
>>
>>
>>
>> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
>>
>>> Dear Noah,
>>>
>>> I saw your post on bioconductor mailing list regarding the totalTest number
>>> for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having
>>> the same problem. Can I ask how you got it sorted?
>>>
>>>
>>> Many thanks.
>>>
>>> Binbin
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list