[BioC] HyperGtest interpretation
James W. MacDonald
jmacdon at uw.edu
Wed Jun 12 16:37:10 CEST 2013
Hi Maria,
On 6/12/2013 9:37 AM, Maria [guest] wrote:
> Dear all,
>
> I know this is a silly quation but I am having trouble interpreting the table of the hyper geometric test result.
I wouldn't say it is a silly question (or quation, for that matter ;-D).
>
> I know that the p-value is the significance value that the obtained go term is not by chance, but I don´t know what the expcount and odds ratio mean.
The ExpCount is the expected count of genes with the given GO term under
the null distribution. The goal of the test is to find GO terms that are
'enriched' in your set of significant genes. In practice what this means
is that we are looking for GO terms for which there are more genes (of
that type) in your set of significant genes than we would expect.
In each row there are three columns that give counts. The 'Count' column
is the count of genes that are annotated to that GO ID in your set of
significant genes. The 'Size' column is the number of such genes that
are on the array, and the ExpCount column gives the expected number of
such genes if there were no enrichment.
As an example, let's say there are 200 significant genes, and 20,000
genes on the array, and there are 500 genes on the array that are
annotated to GO:0000001. The ExpCount is the expected number of genes
annotated to GO:000001 if we were to randomly select 200 genes from the
20,000 on the array. If you get much more or less than the expected
number, then this is not likely to arise by chance, so we assume that it
occurred because the set of 200 genes you selected are 'enriched' for
that GO term.
The odds ratio isn't IMO that helpful in this context. The general
interpretation of an odds ratio is that we are comparing the odds of
something happening to one group as compared to another. In
epidemiological studies this is a reasonable thing to compute. As an
example, you could look at smokers and non-smokers and count up the
number of each that got lung cancer. If you then compute the odds ratio,
you calculate the odds of getting lung cancer if you are a smoker versus
the odds if you are not a smoker (and oddly enough, the odds are higher
for smokers - whodathunk?).
In this context, the thing that occurs (like getting cancer in the
example above), is that a gene is selected as being significant. So the
odds ratio gives the odds of being selected given that a gene is of
GO:00001 as compared to the odds of being selected given that a gene is
NOT annotated to GO:00001. Which IMO doesn't have an intuitive
interpretation in this context.
Best,
Jim
>
> Thank you
>
> Maria
>
> -- output of sessionInfo():
>
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list