[BioC] totalTest number in ChipPeakAnno
Zhu, Lihua (Julie)
Julie.Zhu at umassmed.edu
Fri Nov 19 15:41:01 CET 2010
In the current implementation of makeVennDiagram, the time used to calculate
p-value does not depend on the totalTest.
Noah, thanks so much for sharing your insights!
On 11/18/10 11:26 AM, "Binbin Liu" <B.B.Liu at leeds.ac.uk> wrote:
> Dear Noah,
> Many thanks for your detailed explanation on how totalTest is defined. What I
> am doing is similar to the second case. However, the TF we are interested
> could bind anywhere on the genome. So with mm9 of 2.7E+9 and peak width <=200
> bps , the totalTest is 1.35E+7. It seems very computational costly to run
> ChIPpeakAnno. Nevertheless, do you think it is reasonable?
> On 16 Nov 2010, at 18:41, Noah Dowell wrote:
>> Hello Binbin,
>> It would be helpful to describe your problem and post to the whole message
>> board. (There are many experts who probably can be more helpful than myself
>> :-)) That said, I think you are referring to the "NaN" error and below are
>> my thoughts (Julie Zhu also answered this a couple of times and her reply is
>> probably in the archives).
>> When calling the makeVennDiagram function you want to set the totalTest
>> number to something that is larger than the experimentally determined peak
>> number. As far as I know, the totalTest number is used for the
>> hypergeometric sampling that is used to determine if the overlap between two
>> datasets is more than would be expected by chance. So one way to sort this
>> out using biological information is to think about the maximum number of
>> possible binding events and use that as the totalTest number. For example,
>> if you are studying a sequence-specific DNA binding protein with a known
>> motif you could count that number of times that motif occurs in the genome
>> and compare that to the number of peaks you have experimentally determined.
>> Motifs = 500
>> Peaks = 200
>> Peaks w/ motif = 180 (90%)
>> "upper limit" = 500
>> new "upper limit" for totalTest = .9 x 500 = 450
>> Now if your working with a sequence-independent binding factor it can get
>> tricky. One approach would be to determine the mean peak width. Then divide
>> the whole genome sequence by this number to get an upper limit. This is
>> probably way to high so using additional information such as if the protein
>> binds intergenic or ORFs could bring the number down but make it more
>> relevant to the biological experiment. For example:
>> peaks = 75
>> intergenic peaks = 70
>> ORF peaks = 5
>> mean peak width = 50 base pairs
>> genome size = 10000 base pairs
>> "upper limit" = 10000/ 50 = 200 (possible peaks)
>> intergenic seq = 4000 base pairs
>> new "upper limit" = 4000/50 = 80 (possible intergenic peaks)
>> I was working with something more like the second case and I felt the
>> totalTest based on the total genome was quite relaxed and based on the
>> intergenic sequence only was quite stringent so somewhere in the middle might
>> be better but most importantly I feel I am standing on some solid biological
>> reasoning for determining the amount of sampling.
>> Hope this helps and I would be interested to here if anybody has some
>> critiques of this approach or additional suggestions.
>> On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:
>>> Dear Noah,
>>> I saw your post on bioconductor mailing list regarding the totalTest number
>>> for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having
>>> the same problem. Can I ask how you got it sorted?
>>> Many thanks.
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> Search the archives:
More information about the Bioconductor