[R] Multivariate hypergeometric distribution version of phyper()

Charles C. Berry cberry at tajo.ucsd.edu
Tue Mar 30 23:38:14 CEST 2010


On Tue, 30 Mar 2010, Karl Brand wrote:

> Peter, Chuck,
>
> Big thanks for your input.
>
> I will be following up each and every of your suggestions on the morrow.
>
> Something of note though that you may have further thoughts on- phyper() was 
> *specifically* recommneded by BioC responders for my application in spite of 
> the fact i originally thought a bootstrapping approach seemed most logical 
> given a quasi dependency* between gene lists. I implore you to have a 
> thorough read through my recent BioC thread here:
>
> http://permalink.gmane.org/gmane.science.biology.informatics.conductor/27909
>
> After more than a week since posting on the BioC without response, i reposted 
> here the question you now kindly addressed. Clearly i'm yet to resolve this 
> by following up your suggestions, and any response you may offer on the BioC 
> thread will no doubt be enlightening for me.
>
> Thanks again,
>
> Karl
>
> *Probably gene expression has varying levels of dependancy, but atleast for 
> comparing the 3 lists i can say they all come from independent biological 
> replicates (animals) which in theory doesn't violate any of phypers 
> assumptions. Right?

Wrong. The 'independent' in 'independent experiments' is not the same as 
in 'independent observations' (which refers to stochastic independence).

And in a gene expression study, genes typically act in coordination. In 
fact, my comment that "the hypergeometric gives results that are 
astonishingly anticonservative" was informed by at least one experience in 
which the hypergeometric gave p-values many orders of magnitude smaller 
than a test based on more reasonable assumptions.

Chuck

p.s. the 'block bootstrap' I referred to differs from the standard, 
plug-in bootstrap. Standard bootstrap samples often do not properly 
mirror the data generating process in genomic contexts.

I won't go into my 2 gene-list comparisons which are 
> between 'paired' tissues each derived from the same animals. They probably 
> can not be considered truly independant...
>
>
>
> On 3/30/2010 7:04 PM, Peter Ehlers wrote:
>>  Karl,
>>
>>  I strongly support Chuck's recommendations.
>>  If you do still want to compute such probabilities 'by hand',
>>  you could consider the lchoose() function which does work
>>  for your example.
>>
>>  -Peter Ehlers
>>
>>  On 2010-03-30 9:55, Charles C. Berry wrote:
>> >  On Tue, 30 Mar 2010, Karl Brand wrote:
>> > 
>> > >  Dear R Users,
>> > > 
>> > >  I employed the phyper() function to estimate the likelihood that the
>> > >  number of genes overlapping between 2 different lists of genes is due
>> > >  to chance. This appears to work appropriately.
>> > > 
>> > >  Now i want to try this with 3 lists of genes which phyper() does not
>> > >  appear to support.
>> > > 
>> > >  Some googling suggests i can utilize the Multivariate hypergeometric
>> > >  distribution to achieve this. eg.:
>> > > 
>> > >  http://en.wikipedia.org/wiki/Hypergeometric_distribution
>> > > 
>> > >  But when i try to do this manually using the choose() function (see
>> > >  attempt below example with just two gene lists) i'm unable to perform
>> > >  the calculations- the numbers hit infinity before getting an answer.
>> > > 
>> > >  Searching cran archives for "Multivariate hypergeometric" show this
>> > >  term in the vignettes of package's ‘combinat’ and ‘forward’. But i'm
>> > >  unable to make sense of the these pachakege functions in the context
>> > >  of my aforementioned apllication.
>> > > 
>> > >  Can some one suggest a function, script or method to achieve my goal
>> > >  of estimating the likelyhood of overlap between 3 lists of genes,
>> > >  ideally using the multivariate hypergeometric, or anything else for
>> > >  that matter?
>> > 
>> > 
>> >  Two suggestions:
>> > 
>> >  1) Don't! Likely the theory is unsuited for the application. In
>> >  most applications that generate lists of genes, the genes are
>> >  not iid realizations and the hypergeometric gives results that
>> >  are astonishingly anticonservative. As an alternative , the
>> >  block bootstrap may be suitable. See
>> >  http://171.66.122.45/cgi/content/abstract/17/6/760
>> > 
>> >  and Google (scholar) 'genomic "block bootstrap"' for some
>> >  starting points.
>> > 
>> > 
>> >  2) Take this thread to the bioconductor list. You are much
>> >  more likely to get pointers to useful packages and functions
>> >  for genomic statistical software there.
>> > 
>> >  HTH,
>> > 
>> >  Chuck
>> > 
>> > 
>> > > 
>> > >  cheers in advance,
>> > > 
>> > >  Karl
>> > > 
>> > > 
>> > > 
>> > >  #example attempt with two gene lists m & n
>> > >  N <- 45101 # total number balls in urn
>> > >  m <- 720 # number of 'white' or 'special' balls in urn, aka 'success'
>> > >  n <- 801 # number balls drawn or number of samples
>> > >  k <- 40 # number of 'white' or 'special' balls DRAWN
>> > > 
>> > >  a <- choose(m,k)
>> > >  b <- choose((N-m),(n-k))
>> > >  z <- choose(N,n)
>> > >  prK <- (a*b)/z #'the answer'
>> > >  print(prK)
>> > >  [1] NaN
>> > > 
>> > > >  a
>> > >  [1] 7.985852e+65
>> > > >  b
>> > >  [1] Inf
>> > > >  z
>> > >  [1] Inf
>> > > 
>> > > 
>> > >  --
>> > >  Karl Brand
>> > >  Department of Genetics
>> > >  Erasmus MC
>> > >  Dr Molewaterplein 50
>> > >  3015 GE Rotterdam
>> > >  T +31 (0)10 704 3457 | F +31 (0)10 704 4743 | M +31 (0)642 777 268
>> > > 
>> > >  ______________________________________________
>> > >  R-help at r-project.org mailing list
>> > >  https://stat.ethz.ch/mailman/listinfo/r-help
>> > >  PLEASE do read the posting guide
>> > >  http://www.R-project.org/posting-guide.html
>> > >  and provide commented, minimal, self-contained, reproducible code.
>> > > 
>> > 
>> >  Charles C. Berry (858) 534-2098
>> >  Dept of Family/Preventive Medicine
>> >  E mailto:cberry at tajo.ucsd.edu UC San Diego
>> >  http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 
>> >  92093-0901
>> > 
>> > 
>> > 
>> >  ______________________________________________
>> >  R-help at r-project.org mailing list
>> >  https://stat.ethz.ch/mailman/listinfo/r-help
>> >  PLEASE do read the posting guide
>> >  http://www.R-project.org/posting-guide.html
>> >  and provide commented, minimal, self-contained, reproducible code.
>> 
>
> -- 
> Karl Brand
> Department of Genetics
> Erasmus MC
> Dr Molewaterplein 50
> 3015 GE Rotterdam
> T +31 (0)10 704 3457 |F +31 (0)10 704 4743 |M +31 (0)642 777 268
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list