[BioC] HyperGtest interpretation

James W. MacDonald jmacdon at uw.edu
Wed Jun 12 16:37:10 CEST 2013


Hi Maria,

On 6/12/2013 9:37 AM, Maria [guest] wrote:
> Dear all,
>
> I know this is a silly quation but I am having trouble interpreting the table of the hyper geometric test result.

I wouldn't say it is a silly question (or quation, for that matter ;-D).

>
> I know that the p-value is the significance value that the obtained go term is not by chance, but I don´t know what the expcount and odds ratio mean.

The ExpCount is the expected count of genes with the given GO term under 
the null distribution. The goal of the test is to find GO terms that are 
'enriched' in your set of significant genes. In practice what this means 
is that we are looking for GO terms for which there are more genes (of 
that type) in your set of significant genes than we would expect.

In each row there are three columns that give counts. The 'Count' column 
is the count of genes that are annotated to that GO ID in your set of 
significant genes. The 'Size' column is the number of such genes that 
are on the array, and the ExpCount column gives the expected number of 
such genes if there were no enrichment.

As an example, let's say there are 200 significant genes, and 20,000 
genes on the array, and there are 500 genes on the array that are 
annotated to GO:0000001. The ExpCount is the expected number of genes 
annotated to GO:000001 if we were to randomly select 200 genes from the 
20,000 on the array. If you get much more or less than the expected 
number, then this is not likely to arise by chance, so we assume that it 
occurred because the set of 200 genes you selected are 'enriched' for 
that GO term.

The odds ratio isn't IMO that helpful in this context. The general 
interpretation of an odds ratio is that we are comparing the odds of 
something happening to one group as compared to another. In 
epidemiological studies this is a reasonable thing to compute. As an 
example, you could look at smokers and non-smokers and count up the 
number of each that got lung cancer. If you then compute the odds ratio, 
you calculate the odds of getting lung cancer if you are a smoker versus 
the odds if you are not a smoker (and oddly enough, the odds are higher 
for smokers - whodathunk?).

In this context, the thing that occurs (like getting cancer in the 
example above), is that a gene is selected as being significant. So the 
odds ratio gives the odds of being selected given that a gene is of 
GO:00001 as compared to the odds of being selected given that a gene is 
NOT annotated to GO:00001. Which IMO doesn't have an intuitive 
interpretation in this context.

Best,

Jim

>
> Thank you
>
> Maria
>
>   -- output of sessionInfo():
>
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099



More information about the Bioconductor mailing list