[BioC] hypergeometric test in GOstats
Sebastien Gerega
seb at gerega.net
Fri Mar 13 02:41:47 CET 2009
Hi Robert,
sorry about not having included the sessionInfo(). I have included all
the required information this time - my sessionInfo() and links to the
input data can be found at the end of the email.
Here is the code that I am using:
entrezUni = read.table("entrezUni.tsv", sep="\t")[,1]
sigEntrez = read.table("sigEntrez.tsv", sep="\t")[,1]
params = new("GOHyperGParams", geneIds=as.integer(sigEntrez),
universeGeneIds = as.integer(entrezUni),
annotation="mouse4302.db", pvalueCutoff=0.05, testDirection="over",
ontology = "BP", conditional=FALSE)
hgOver = hyperGTest(params)
report = summary(hgOver)
length(entrezUni)
length(sigEntrez)
report[1:3,]
and the output:
> length(entrezUni)
[1] 1803
> length(sigEntrez)
[1] 266
> report[1:3,]
GOBPID Pvalue OddsRatio ExpCount Count Size
Term
1 GO:0006952 1.519928e-16 5.660429 18.81197 55 124 defense
response
2 GO:0050896 9.662799e-16 3.570590 50.36752 99 332 response to
stimulus
3 GO:0002376 9.810622e-16 4.164349 31.70726 74 209 immune system
process
So I am not using a conditional test and I have searched through the
mailing list and documentation for information about how ExpCount is
calculated but have not found an answer.
I would have expected it to be a simple case of
Size/universeGeneIds*geneIds
However, this is not the case:
> man = report[1:3,]$Size/length(entrezUni)*length(sigEntrez)
> report[1:3,]$ExpCount - man
[1] 0.5180113 1.3869335 0.8730997
I understand that I may be omitting some details or steps in the process
but I have searched and been unable to find any information regarding this.
Your help is very much appreciated.
regards,
Sebastien
http://sydneybioinformatics.org/download/sigEntrez.tsv
http://sydneybioinformatics.org/download/entrezUni.tsv
> sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32
locale:
LC_COLLATE=English_Australia.1252;LC_CTYPE=English_Australia.1252;LC_MONETARY=English_Australia.1252;LC_NUMERIC=C;LC_TIME=English_Australia.1252
attached base packages:
[1] splines tools stats graphics grDevices utils
datasets methods base
other attached packages:
[1] GOstats_2.8.0 Category_2.8.2 genefilter_1.22.0
survival_2.34-1 RBGL_1.18.0 annotate_1.20.1
xtable_1.5-4
[8] GO.db_2.2.5 graph_1.20.0 mouse4302.db_2.2.5
RSQLite_0.7-1 DBI_0.2-4 AnnotationDbi_1.4.2
Biobase_2.2.1
loaded via a namespace (and not attached):
[1] cluster_1.11.12 GSEABase_1.4.0 XML_1.99-0
Robert Gentleman wrote:
> Hi Sebastien,
> It is expected that you do a little bit of the homework before posting.
> Some things to try:
>
> 1) please read the posting guide, you need to give us information on your
> particular version of R and Bioconductor, so that we can attempt to reproduce
> the results you have.
> 2) you need to give a reproducible example, so we can test and make sure we are
> getting the same answer as you do. In this case you did not show us even the
> call you used, so we have no way of knowing what was computed, and hence cannot
> do more than speculate - which is a waste of everyone's time.
> 3) you need to read the documentation, there are manual pages and a vignette.
> These are sometimes unclear, and/or incomplete, and letting us know what is not
> clear helps us to improve them.
> 4) you should check the mailing list archive to see if the topic has already
> been discussed (this one has come up very many times), and then you can readily
> obtain your answer.
>
> So, some speculation, I suspect that your test is conditional.
> And if I understand your example, this is easily checked by a direct call to
> fisher.test, which is different from the parts that you did report, again
> suggesting that your testing is conditional.
>
>
>> xm
>>
> [,1] [,2]
> [1,] 55 211
> [2,] 69 1468
>
>> fisher.test(xm)
>>
>
> Fisher's Exact Test for Count Data
>
> data: xm
> p-value < 2.2e-16
> alternative hypothesis: true odds ratio is not equal to 1
> 95 percent confidence interval:
> 3.701153 8.260393
> sample estimates:
> odds ratio
> 5.537531
>
> best wishes
> Robert
>
> Sebastien Gerega wrote:
>
>> Hi,
>> I would like to get a better understanding of exactly how the
>> calculations in GOstats are performed.
>> Does the package use a standard hypergeometric test?
>> If I have the following values:
>> geneIds = 266
>> universeGeneIds = 1803
>>
>> and a particular GO term has:
>> size = 124
>> count = 55
>>
>> then the associated p-value, odds ratio and expected count are
>> 1.519928e-16, 5.660429, and 18.81197 respectively.
>>
>> However, I would have thought the expected count would be 266 * 124 /
>> 1803 = 18.29. In addition, the p-value obtained from the hypergeometric
>> test using the following website http://keisan.casio.com/has10/SpecExec.cgi
>> is different.
>>
>> Are there any steps performed by the GOstats package that make it
>> different from a standard hypergeo test? What is the reason for these
>> differences?
>> thanks in advance,
>> Sebastien
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
More information about the Bioconductor
mailing list