[R] Multivariate hypergeometric distribution version of phyper()
On Tue, 30 Mar 2010, Karl Brand wrote:
Something of note though that you may have further thoughts on- phyper() was
*specifically* recommneded by BioC responders for my application in spite of
the fact i originally thought a bootstrapping approach seemed most logical
given a quasi dependency* between gene lists. I implore you to have a
thorough read through my recent BioC thread here:
http://permalink.gmane.org/gmane.science.biology.informatics.conductor/27909
*Probably gene expression has varying levels of dependancy, but atleast for
comparing the 3 lists i can say they all come from independent biological
replicates (animals) which in theory doesn't violate any of phypers
assumptions. Right?
Wrong. The 'independent' in 'independent experiments' is not the same as
in 'independent observations' (which refers to stochastic independence).
And in a gene expression study, genes typically act in coordination. In
fact, my comment that "the hypergeometric gives results that are
astonishingly anticonservative" was informed by at least one experience in
which the hypergeometric gave p-values many orders of magnitude smaller
than a test based on more reasonable assumptions.
Chuck
p.s. the 'block bootstrap' I referred to differs from the standard,
plug-in bootstrap. Standard bootstrap samples often do not properly
mirror the data generating process in genomic contexts.
I won't go into my 2 gene-list comparisons which are
> between 'paired' tissues each derived from the same animals. They probably
> can not be considered truly independant...
On 3/30/2010 7:04 PM, Peter Ehlers wrote:
If you do still want to compute such probabilities 'by hand',
you could consider the lchoose() function which does work
for your example.
On 2010-03-30 9:55, Charles C. Berry wrote:
>> > >
number of genes overlapping between 2 different lists of genes is due
to chance. This appears to work appropriately.
Now i want to try this with 3 lists of genes which phyper() does not
>> > >
Some googling suggests i can utilize the Multivariate hypergeometric
distribution to achieve this. eg.:
>> > >
But when i try to do this manually using the choose() function (see
>> > > distribution to achieve this. eg.:
>> > >
Searching cran archives for "Multivariate hypergeometric" show this
>> > >
unable to make sense of the these pachakege functions in the context
of my aforementioned apllication.
Can some one suggest a function, script or method to achieve my goal
>> > >
ideally using the multivariate hypergeometric, or anything else for
that matter?
>> > > unable to make sense of the these pachakege functions in the context
>> > > of my aforementioned apllication.
>> > >
>> > > Can some one suggest a function, script or method to achieve my goal
>> > > of estimating the likelyhood of overlap between 3 lists of genes,
>> > > ideally using the multivariate hypergeometric, or anything else for
>> > > that matter?
>> >
>> > Two suggestions:
>> >
are astonishingly anticonservative. As an alternative , the
block bootstrap may be suitable. See
http://171.66.122.45/cgi/content/abstract/17/6/760
and Google (scholar) 'genomic "block bootstrap"' for some
starting points.
2) Take this thread to the bioconductor list. You are much
>> >
for genomic statistical software there.
>> > starting points.
>> >
>> >
>> > 2) Take this thread to the bioconductor list. You are much
>> > more likely to get pointers to useful packages and functions
>> > for genomic statistical software there.
>> >
#example attempt with two gene lists m & n
N <- 45101 # total number balls in urn
m <- 720 # number of 'white' or 'special' balls in urn, aka 'success'
n <- 801 # number balls drawn or number of samples
k <- 40 # number of 'white' or 'special' balls DRAWN
>> > >
b <- choose((N-m),(n-k))
z <- choose(N,n)
prK <- (a*b)/z #'the answer'
print(prK)
[1] NaN
> a
>> > >
> b
[1] Inf
> z
[1] Inf
>> > > > z
>> > > [1] Inf
>> > >
>> > >
