[BioC] hypergeometric test in GOstats

Tue Mar 17 23:23:27 CET 2009

Hi Sebastien,

The source of your confusion on this is the fact that not every gene is
tagged with a GO term.  Thus, the size of both your gene universe and
your gene samples are not what you think they are.  They are actually
both smaller because only those genes which have GO annotation in some
form can reasonably be included.  After GOstats gets finished paring
down the gene universe it becomes 1404 long instead of 1803, and then it
has to only include things in your sample that are actually contained
within that "pared down" universe which results in your sample being
only 213 long instead of 266.  That is why the expected count is actually

213*124/1404 = 18.81197 instead of 18.29.

Actually, you could have even seen that this paring down had occurred by
looking at the hgOver object that your code produces:

> hgOver
Gene to GO BP  test for over-representation
1244 GO BP ids tested (131 have p < 0.05)
Selected gene set size: 213
    Gene universe size: 1404
    Annotation package: mouse4302

Hope this clarifies things!

  Marc

Sebastien Gerega wrote:
> Hi Robert,
> sorry about not having included the sessionInfo(). I have included all
> the required information this time - my sessionInfo() and links to the
> input data can be found at the end of the email.
>
>
> Here is the code that I am using:
> entrezUni = read.table("entrezUni.tsv", sep="\t")[,1]
> sigEntrez = read.table("sigEntrez.tsv", sep="\t")[,1]
> params = new("GOHyperGParams", geneIds=as.integer(sigEntrez),
> universeGeneIds = as.integer(entrezUni),
>    annotation="mouse4302.db", pvalueCutoff=0.05, testDirection="over",
> ontology = "BP", conditional=FALSE)
> hgOver = hyperGTest(params)
> report = summary(hgOver)
> length(entrezUni)
> length(sigEntrez)
> report[1:3,]
>
>
> and the output:
> > length(entrezUni)
> [1] 1803
> > length(sigEntrez)
> [1] 266
> > report[1:3,]
>      GOBPID       Pvalue OddsRatio ExpCount Count
> Size                  Term
> 1 GO:0006952 1.519928e-16  5.660429 18.81197    55  124      defense
> response
> 2 GO:0050896 9.662799e-16  3.570590 50.36752    99  332  response to
> stimulus
> 3 GO:0002376 9.810622e-16  4.164349 31.70726    74  209 immune system
> process
>
>
> So I am not using a conditional test and I have searched through the
> mailing list and documentation for information about how ExpCount is
> calculated but have not found an answer.
> I would have expected it to be a simple case of
>
> Size/universeGeneIds*geneIds
>
> However, this is not the case:
> > man = report[1:3,]$Size/length(entrezUni)*length(sigEntrez)
> > report[1:3,]$ExpCount - man
> [1] 0.5180113 1.3869335 0.8730997
>
>
> I understand that I may be omitting some details or steps in the
> process but I have searched and been unable to find any information
> regarding this.
> Your help is very much appreciated.
> regards,
> Sebastien
>
> http://sydneybioinformatics.org/download/sigEntrez.tsv
> http://sydneybioinformatics.org/download/entrezUni.tsv
>
> > sessionInfo()
> R version 2.8.1 (2008-12-22)
> i386-pc-mingw32
>
> locale:
> LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
>
>
> attached base packages:
> [1] splines   tools     stats     graphics  grDevices utils    
> datasets  methods   base   
> other attached packages:
> [1] GOstats_2.8.0       Category_2.8.2      genefilter_1.22.0  
> survival_2.34-1     RBGL_1.18.0         annotate_1.20.1    
> xtable_1.5-4      [8] GO.db_2.2.5         graph_1.20.0       
> mouse4302.db_2.2.5  RSQLite_0.7-1       DBI_0.2-4          
> AnnotationDbi_1.4.2 Biobase_2.2.1    
> loaded via a namespace (and not attached):
> [1] cluster_1.11.12 GSEABase_1.4.0  XML_1.99-0
>
>
> Robert Gentleman wrote:
>> Hi Sebastien,
>>   It is expected that you do a little bit of the homework before
>> posting.
>> Some things to try:
>>
>> 1) please read the posting guide, you need to give us information on
>> your
>> particular version of R and Bioconductor, so that we can attempt to
>> reproduce
>> the results you have.
>> 2) you need to give a reproducible example, so we can test and make
>> sure we are
>> getting the same answer as you do.  In this case you did not show us
>> even the
>> call you used, so we have no way of knowing what was computed, and
>> hence cannot
>> do more than speculate - which is a waste of everyone's time.
>> 3) you need to read the documentation, there are manual pages and a
>> vignette.
>> These are sometimes unclear, and/or incomplete, and letting us know
>> what is not
>> clear helps us to improve them.
>> 4) you should check the mailing list archive to see if the topic has
>> already
>> been discussed (this one has come up very many times), and then you
>> can readily
>> obtain your answer.
>>
>> So, some speculation, I suspect that your test is conditional.
>> And if I understand your example, this is easily checked by a direct
>> call to
>> fisher.test, which is different from the parts that you did report,
>> again
>> suggesting that your testing is conditional.
>>
>>  
>>> xm
>>>     
>>      [,1] [,2]
>> [1,]   55  211
>> [2,]   69 1468
>>  
>>> fisher.test(xm)
>>>     
>>
>>         Fisher's Exact Test for Count Data
>>
>> data:  xm
>> p-value < 2.2e-16
>> alternative hypothesis: true odds ratio is not equal to 1
>> 95 percent confidence interval:
>>  3.701153 8.260393
>> sample estimates:
>> odds ratio
>>   5.537531
>>
>> best wishes
>>   Robert
>>
>> Sebastien Gerega wrote:
>>  
>>> Hi,
>>> I would like to get a better understanding of exactly how the
>>> calculations in GOstats are performed.
>>> Does the package use a standard hypergeometric test?
>>> If I have the following values:
>>> geneIds = 266
>>> universeGeneIds = 1803
>>>
>>> and a particular GO term has:
>>> size = 124
>>> count = 55
>>>
>>> then the associated p-value, odds ratio and expected count are
>>> 1.519928e-16, 5.660429, and 18.81197 respectively.
>>>
>>> However, I would have thought the expected count would be 266 * 124 /
>>> 1803 = 18.29. In addition, the p-value obtained from the hypergeometric
>>> test using the following website
>>> http://keisan.casio.com/has10/SpecExec.cgi
>>> is different.
>>>
>>> Are there any steps performed by the GOstats package that make it
>>> different from a standard hypergeo test? What is the reason for these
>>> differences?
>>> thanks in advance,
>>> Sebastien
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>     
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>