[BioC] function similar to phyper function that can handle 3 or more gene lists

Karl Brand k.brand at erasmusmc.nl
Mon Mar 22 11:16:11 CET 2010


Dear List,

This is a repost of-

Re: [BioC] package or code to quantify the significance of the venn 
overlap between 2 or 3 lists of genes

-with a related, but new question born of my success using phyper.

I employed phyper to estimate the likelihood that the number of genes 
overlapping between 2 different lists of genes is due to chance.

I need to do the same with 3 lists of genes which phyper doesn't appear 
capable of. Can anyone recommend a function or share a script which 
might achieve this? Previous post/discussion below if it helps.

With thanks in advance, cheers,

Karl


> -------- Original Message --------
> Subject: Re: [BioC] package or code to quantify the significance of the
> venn overlap between 2 or 3 lists of genes
> Date: Thu, 18 Mar 2010 17:25:14 +0100
> From: Karl Brand<k.brand at erasmusmc.nl>
> To: bioconductor at stat.math.ethz.ch<bioconductor at stat.math.ethz.ch>
> CC: Wolfgang Huber<whuber at embl.de>, MCM at stowers.org, seandavi at gmail.com
>
> Dear List,
>
> I tried the phyper function as follows:
>
> #phyper(overlaplistA&B-1, genelistA, totalprobesonchip-genelistA,
> genelistB, lower.tail = FALSE, log.p = FALSE)
>
> Of which the output seemed logical to me. But I'd really appreciate some
> ones patience and experience to confirm some concerns:
>
> -is it 'safe' to employ this test where genelistA and genelistB were
> obtained from AnimalX-tissue1 and AnimalX-tisse2 respectively? ie., do i
> violate any data independence issue's this test assumes?
>
> -the output Value is a 'distribution function'. Can i interpret this to
> be something like the 'likelihood that my observed result is due to
> chance alone'?
>
> -do in i need to subtract 1 from my 'overlap'? In the example i followed
> at tinyurl.com/ygtmefa this appaears to be the case, but the vignette
> has nothing on this.
>
> *most of all* how can i perform this test on three lists of overlapping
> gene's, not merely the two in this case? Maybes some one knows a
> hack/method to combine the 3 outputs (of three pairwise comparisons) for
> an estimate of the 3-way overlap? Even a conservative estimate would be
> better than nothing!
>
> With thanks in advance for thoughts and suggestions, cheers,
>
> Karl
>
>
>
> On 3/17/2010 5:16 PM, Karl Brand wrote:
>> Thank you Wolfgang, Madelaine,
>>
>> I'd rather not reinvent the wheel if i can help it.
>>
>> And if you you'll humor me a little longer, perhaps you can ensure i do
>> what you suggest correctly for my exact application.
>>
>> The overalps i have are between 6 datasets. The experiment consisted of
>> a treatment (Pperiod) with 3 levels (S, E&  L) applied to 2 tissues (R&
>> C). FYI targets file below if it helps. Each of the 6 datasets contain
>> 16 time points on which i interrogated for transcripts which fit a sine
>> curve and several other criteria, thus defining a list of 'rhythmic
>> genes' for each of the 6 datasets.
>>
>> So an obvious question is what rhythmic transcripts are common between
>> various combination's of the 6 data sets. Combination's being-
>>
>> Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for
>> tissue 'R'
>> Venn 2: As above for tissue 'C'
>> Venn 3: Overlapping 'R' and 'C' for treatment level 1 only.
>> Venn 4: As for 3. for treatment level 2 only.
>> Venn 5: As for 3. for treatment level 3 only.
>>
>> So what i meant by "non-independent gene lists" i think might apply to
>> Venn 3, 4 and 5 given the fact that tissues 'R'&  'C' are obtained from
>> the same animals, albeit 16 of them, and as time course's. But still,
>> they can not strictly speaking be considered independent right? Which i
>> thought some tests, including Fishers depend on.
>>
>> Knowing this, would you think the phyper function is the right one for
>> my purpose. If so i'll plough on with the vindication of atleast the
>> confidence that...some one with alot more experience on this than me
>> recommends it!
>>
>> Again my thanks for engaging my query,
>>
>> Karl
>>
>>
>> "RNA_Targets.txt"-
>>
>> FileName Tissue Pperiod Time Animal
>> 01file.CEL R S 1 1
>> 02file.CEL C S 1 1
>> 03file.CEL R S 2 2
>> 04file.CEL C S 2 2
>> 05file.CEL R S 3 3
>> 06file.CEL C S 3 3
>> 07file.CEL R S 4 4
>> 08file.CEL C S 4 4
>> 09file.CEL R S 5 5
>> 10file.CEL C S 5 5
>> 11file.CEL R S 6 6
>> 12file.CEL C S 6 6
>> 13file.CEL R S 7 7
>> 14file.CEL C S 7 7
>> 15file.CEL R S 8 8
>> 16file.CEL C S 8 8
>> 17file.CEL R S 9 9
>> 18file.CEL C S 9 9
>> 19file.CEL R S 10 10
>> 20file.CEL C S 10 10
>> 21file.CEL R S 11 11
>> 22file.CEL C S 11 11
>> 23file.CEL R S 12 12
>> 24file.CEL C S 12 12
>> 25file.CEL R S 13 13
>> 26file.CEL C S 13 13
>> 27file.CEL R S 14 14
>> 28file.CEL C S 14 14
>> 29file.CEL R S 15 15
>> 30file.CEL C S 15 15
>> 31file.CEL R S 16 16
>> 32file.CEL C S 16 16
>> 33file.CEL R E 1 17
>> 34file.CEL C E 1 17
>> 35file.CEL R E 2 18
>> 36file.CEL C E 2 18
>> 37file.CEL R E 3 19
>> 38file.CEL C E 3 19
>> 39file.CEL R E 4 20
>> 40file.CEL C E 4 20
>> 41file.CEL R E 5 21
>> 42file.CEL C E 5 21
>> 43file.CEL R E 6 22
>> 44file.CEL C E 6 22
>> 45file.CEL R E 7 23
>> 46file.CEL C E 7 23
>> 47file.CEL R E 8 24
>> 48file.CEL C E 8 24
>> 49file.CEL R E 9 25
>> 50file.CEL C E 9 25
>> 51file.CEL R E 10 26
>> 52file.CEL C E 10 26
>> 53file.CEL R E 11 27
>> 54file.CEL C E 11 27
>> 55file.CEL R E 12 28
>> 56file.CEL C E 12 28
>> 57file.CEL R E 13 29
>> 58file.CEL C E 13 29
>> 59file.CEL R E 14 30
>> 60file.CEL C E 14 30
>> 61file.CEL R E 15 31
>> 62file.CEL C E 15 31
>> 63file.CEL R E 16 32
>> 64file.CEL C E 16 32
>> 65file.CEL R L 1 33
>> 66file.CEL C L 1 33
>> 67file.CEL R L 2 34
>> 68file.CEL C L 2 34
>> 69file.CEL R L 3 35
>> 70file.CEL C L 3 35
>> 71file.CEL R L 4 36
>> 72file.CEL C L 4 36
>> 73file.CEL R L 5 37
>> 74file.CEL C L 5 37
>> 75file.CEL R L 6 38
>> 76file.CEL C L 6 38
>> 77file.CEL R L 7 39
>> 78file.CEL C L 7 39
>> 79file.CEL R L 8 40
>> 80file.CEL C L 8 40
>> 81file.CEL R L 9 41
>> 82file.CEL C L 9 41
>> 83file.CEL R L 10 42
>> 84file.CEL C L 10 42
>> 85file.CEL R L 11 43
>> 86file.CEL C L 11 43
>> 87file.CEL R L 12 44
>> 88file.CEL C L 12 44
>> 89file.CEL R L 13 45
>> 90file.CEL C L 13 45
>> 91file.CEL R L 14 46
>> 92file.CEL C L 14 46
>> 93file.CEL R L 15 47
>> 94file.CEL C L 15 47
>> 95file.CEL R L 16 48
>> 96file.CEL C L 16 48
>>
>>
>>
>>
>>
>> On 3/17/2010 4:16 PM, Wolfgang Huber wrote:
>>> Dear Karl
>>>
>>> [reposting to list]
>>>
>>> The bioinformatician was quicker, and provided a hack that "works", but
>>> a statistician might have pointed out that the simulation scheme you
>>> propose below is a needlessly poor and slow approximation of what the
>>> hypergeometric distribution or the Fisher text would do faster and more
>>> exactly.
>>>
>>> "Poor" because the distribution of count variables is (typically and in
>>> particular in your case) not symmetric and using a standard deviation to
>>> define a confidence interval and significance thresholds would ignore
>>> that - i.e. give suboptimal results.
>>>
>>> Don't get me wrong - I think it's great when people are capable to
>>> reinvent the wheel, but to get stuff done, using existing wheel designs
>>> tends to be more productive.
>>>
>>> PS I am not sure what you mean by "non-independent gene lists". If you
>>> already know that the lists are dependent, what exactly do you gain by
>>> showing that their overlap is higher than if they were independent?
>>> Isn't that tautological?
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>>
>>> Karl Brand scripsit 17/03/10 15:45:
>>>> Cheers Wolfgang,
>>>>
>>>> Unfortuantly waiting on my local statistician also take's longer than
>>>> using the calculator :(
>>>>
>>>> Discussion with a much more responsive bioifnormatician yielded the
>>>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.:
>>>>
>>>> By using the same numbers of the chip-background probes (c. 45,000)
>>>> and my short-list of probes of interest (c. 500), randomly selected
>>>> and checking the overlap, performed say 10,000 times, an estimate of
>>>> chance overlap could be obtained, along with a stardard deviation to
>>>> which i could compare my actual results to for an estimate of
>>>> significance, or p-value.
>>>>
>>>> Correct me if we're wrong but this seemed acceptable for Venns of
>>>> non-independent gene lists.
>>>>
>>>> Coding this was what i was appealing for help on since my experience
>>>> here is limiting. But, i'm definately up for a crack at it. I'll start
>>>> by having a look at the "stats" package phyper.
>>>>
>>>> Again with appreciation for your prompt, thoughtful response,
>>>>
>>>> Karl
>>>>
>>>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote:
>>>>> Dear Karl,
>>>>>
>>>>> I don't think what you need here is necessarily a package - the
>>>>> required
>>>>> computations, if possible, are one or a few lines of R using standard
>>>>> functions e.g. in the "stats" package such as phyper.
>>>>>
>>>>> Perhaps the more important thing to do is to precisely define the
>>>>> questions you want to be asking. For this, discussion with a local
>>>>> statistician might be helpful. Once you have that, the answer will
>>>>> probably be fairly obvious from a basic text book on combinatorics
>>>>> (probability theory on discrete variables).
>>>>>
>>>>> Best wishes
>>>>> Wolfgang
>>>>>
>>>>>
>>>>> Karl Brand scripsit 17/03/10 12:26:
>>>>>> Dear BioCers,
>>>>>>
>>>>>> I've got six lists of gene's which i'm focused on the overlaps
>>>>>> between.
>>>>>>
>>>>>> What i'm searching for is a package or code to quantify the
>>>>>> significance of the overlap between both a pair of gene lists, and
>>>>>> also between three gene-lists. Six might be interesting, but not
>>>>>> necessary.
>>>>>>
>>>>>> Specifically, what would the overlap be expected by chance, and how
>>>>>> many standard deviations my actual overlap is from the estimated
>>>>>> chance overlap?
>>>>>>
>>>>>> Whilst some of my lists are independent, others are not in being
>>>>>> derived from tissues of the same origin. I understand this would
>>>>>> exclude such tests like Fishers Rxact test which assume independence.
>>>>>>
>>>>>> By using the same numbers of chip-background probes and short-listed
>>>>>> probes of interest, randomly selected and checking the overlap,
>>>>>> performed say 10,000 times, i think i could obtain the estimates i'm
>>>>>> looking for in a 'statistically acceptable' manner.
>>>>>>
>>>>>> Does anyone know of a package or code written for this purpose? I
>>>>>> failed to find anything in BioConductor or in the BioC lists. As
>>>>>> simple as coding it no doubt is, my lack of R knowledge would make
>>>>>> doing it with a calculator the faster option :)
>>>>>>
>>>>>> Look forward to any recommendations or suggestions with appreciation,
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>
>

-- 
Karl Brand k.brand-asperand-erasmusmc.nl
Department of Genetics
Erasmus MC
Dr Molewaterplein 50
3015 GE Rotterdam
lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268



More information about the Bioconductor mailing list