[BioC] package or code to quantify the significance of the venn overlap between 2 or 3 lists of genes
Karl Brand
k.brand at erasmusmc.nl
Wed Mar 17 17:16:56 CET 2010
Thank you Wolfgang, Madelaine,
I'd rather not reinvent the wheel if i can help it.
And if you you'll humor me a little longer, perhaps you can ensure i do
what you suggest correctly for my exact application.
The overalps i have are between 6 datasets. The experiment consisted of
a treatment (Pperiod) with 3 levels (S, E & L) applied to 2 tissues (R &
C). FYI targets file below if it helps. Each of the 6 datasets contain
16 time points on which i interrogated for transcripts which fit a sine
curve and several other criteria, thus defining a list of 'rhythmic
genes' for each of the 6 datasets.
So an obvious question is what rhythmic transcripts are common between
various combination's of the 6 data sets. Combination's being-
Venn 1: Overlapping the 3 datasets of the 3 levels of treatment for
tissue 'R'
Venn 2: As above for tissue 'C'
Venn 3: Overlapping 'R' and 'C' for treatment level 1 only.
Venn 4: As for 3. for treatment level 2 only.
Venn 5: As for 3. for treatment level 3 only.
So what i meant by "non-independent gene lists" i think might apply to
Venn 3, 4 and 5 given the fact that tissues 'R' & 'C' are obtained from
the same animals, albeit 16 of them, and as time course's. But still,
they can not strictly speaking be considered independent right? Which i
thought some tests, including Fishers depend on.
Knowing this, would you think the phyper function is the right one for
my purpose. If so i'll plough on with the vindication of atleast the
confidence that...some one with alot more experience on this than me
recommends it!
Again my thanks for engaging my query,
Karl
"RNA_Targets.txt"-
FileName Tissue Pperiod Time Animal
01file.CEL R S 1 1
02file.CEL C S 1 1
03file.CEL R S 2 2
04file.CEL C S 2 2
05file.CEL R S 3 3
06file.CEL C S 3 3
07file.CEL R S 4 4
08file.CEL C S 4 4
09file.CEL R S 5 5
10file.CEL C S 5 5
11file.CEL R S 6 6
12file.CEL C S 6 6
13file.CEL R S 7 7
14file.CEL C S 7 7
15file.CEL R S 8 8
16file.CEL C S 8 8
17file.CEL R S 9 9
18file.CEL C S 9 9
19file.CEL R S 10 10
20file.CEL C S 10 10
21file.CEL R S 11 11
22file.CEL C S 11 11
23file.CEL R S 12 12
24file.CEL C S 12 12
25file.CEL R S 13 13
26file.CEL C S 13 13
27file.CEL R S 14 14
28file.CEL C S 14 14
29file.CEL R S 15 15
30file.CEL C S 15 15
31file.CEL R S 16 16
32file.CEL C S 16 16
33file.CEL R E 1 17
34file.CEL C E 1 17
35file.CEL R E 2 18
36file.CEL C E 2 18
37file.CEL R E 3 19
38file.CEL C E 3 19
39file.CEL R E 4 20
40file.CEL C E 4 20
41file.CEL R E 5 21
42file.CEL C E 5 21
43file.CEL R E 6 22
44file.CEL C E 6 22
45file.CEL R E 7 23
46file.CEL C E 7 23
47file.CEL R E 8 24
48file.CEL C E 8 24
49file.CEL R E 9 25
50file.CEL C E 9 25
51file.CEL R E 10 26
52file.CEL C E 10 26
53file.CEL R E 11 27
54file.CEL C E 11 27
55file.CEL R E 12 28
56file.CEL C E 12 28
57file.CEL R E 13 29
58file.CEL C E 13 29
59file.CEL R E 14 30
60file.CEL C E 14 30
61file.CEL R E 15 31
62file.CEL C E 15 31
63file.CEL R E 16 32
64file.CEL C E 16 32
65file.CEL R L 1 33
66file.CEL C L 1 33
67file.CEL R L 2 34
68file.CEL C L 2 34
69file.CEL R L 3 35
70file.CEL C L 3 35
71file.CEL R L 4 36
72file.CEL C L 4 36
73file.CEL R L 5 37
74file.CEL C L 5 37
75file.CEL R L 6 38
76file.CEL C L 6 38
77file.CEL R L 7 39
78file.CEL C L 7 39
79file.CEL R L 8 40
80file.CEL C L 8 40
81file.CEL R L 9 41
82file.CEL C L 9 41
83file.CEL R L 10 42
84file.CEL C L 10 42
85file.CEL R L 11 43
86file.CEL C L 11 43
87file.CEL R L 12 44
88file.CEL C L 12 44
89file.CEL R L 13 45
90file.CEL C L 13 45
91file.CEL R L 14 46
92file.CEL C L 14 46
93file.CEL R L 15 47
94file.CEL C L 15 47
95file.CEL R L 16 48
96file.CEL C L 16 48
On 3/17/2010 4:16 PM, Wolfgang Huber wrote:
> Dear Karl
>
> [reposting to list]
>
> The bioinformatician was quicker, and provided a hack that "works", but
> a statistician might have pointed out that the simulation scheme you
> propose below is a needlessly poor and slow approximation of what the
> hypergeometric distribution or the Fisher text would do faster and more
> exactly.
>
> "Poor" because the distribution of count variables is (typically and in
> particular in your case) not symmetric and using a standard deviation to
> define a confidence interval and significance thresholds would ignore
> that - i.e. give suboptimal results.
>
> Don't get me wrong - I think it's great when people are capable to
> reinvent the wheel, but to get stuff done, using existing wheel designs
> tends to be more productive.
>
> PS I am not sure what you mean by "non-independent gene lists". If you
> already know that the lists are dependent, what exactly do you gain by
> showing that their overlap is higher than if they were independent?
> Isn't that tautological?
>
> Best wishes
> Wolfgang
>
>
>
> Karl Brand scripsit 17/03/10 15:45:
>> Cheers Wolfgang,
>>
>> Unfortuantly waiting on my local statistician also take's longer than
>> using the calculator :(
>>
>> Discussion with a much more responsive bioifnormatician yielded the
>> plan to employ a bootstrap/randomisation (terminology?!) approach. ie.:
>>
>> By using the same numbers of the chip-background probes (c. 45,000)
>> and my short-list of probes of interest (c. 500), randomly selected
>> and checking the overlap, performed say 10,000 times, an estimate of
>> chance overlap could be obtained, along with a stardard deviation to
>> which i could compare my actual results to for an estimate of
>> significance, or p-value.
>>
>> Correct me if we're wrong but this seemed acceptable for Venns of
>> non-independent gene lists.
>>
>> Coding this was what i was appealing for help on since my experience
>> here is limiting. But, i'm definately up for a crack at it. I'll start
>> by having a look at the "stats" package phyper.
>>
>> Again with appreciation for your prompt, thoughtful response,
>>
>> Karl
>>
>> On 3/17/2010 2:48 PM, Wolfgang Huber wrote:
>>> Dear Karl,
>>>
>>> I don't think what you need here is necessarily a package - the required
>>> computations, if possible, are one or a few lines of R using standard
>>> functions e.g. in the "stats" package such as phyper.
>>>
>>> Perhaps the more important thing to do is to precisely define the
>>> questions you want to be asking. For this, discussion with a local
>>> statistician might be helpful. Once you have that, the answer will
>>> probably be fairly obvious from a basic text book on combinatorics
>>> (probability theory on discrete variables).
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>
>>> Karl Brand scripsit 17/03/10 12:26:
>>>> Dear BioCers,
>>>>
>>>> I've got six lists of gene's which i'm focused on the overlaps between.
>>>>
>>>> What i'm searching for is a package or code to quantify the
>>>> significance of the overlap between both a pair of gene lists, and
>>>> also between three gene-lists. Six might be interesting, but not
>>>> necessary.
>>>>
>>>> Specifically, what would the overlap be expected by chance, and how
>>>> many standard deviations my actual overlap is from the estimated
>>>> chance overlap?
>>>>
>>>> Whilst some of my lists are independent, others are not in being
>>>> derived from tissues of the same origin. I understand this would
>>>> exclude such tests like Fishers Rxact test which assume independence.
>>>>
>>>> By using the same numbers of chip-background probes and short-listed
>>>> probes of interest, randomly selected and checking the overlap,
>>>> performed say 10,000 times, i think i could obtain the estimates i'm
>>>> looking for in a 'statistically acceptable' manner.
>>>>
>>>> Does anyone know of a package or code written for this purpose? I
>>>> failed to find anything in BioConductor or in the BioC lists. As
>>>> simple as coding it no doubt is, my lack of R knowledge would make
>>>> doing it with a calculator the faster option :)
>>>>
>>>> Look forward to any recommendations or suggestions with appreciation,
>>>>
>>>> Karl
>>>>
>>>>
>>>
>>>
>>
>
>
--
Karl Brand k.brand-asperand-erasmusmc.nl
Department of Genetics
Erasmus MC
Dr Molewaterplein 50
3015 GE Rotterdam
lab +31 (0)10 704 3409 fax +31 (0)10 704 4743 mob +31 (0)642 777 268
More information about the Bioconductor
mailing list