[BioC] Input for KS test
Ken Termiso
jerk_alert at hotmail.com
Fri Jul 20 19:58:28 CEST 2007
Hello,
I'd like to threshold-free stats (i.e. Kolmogorov-Smirnov) to ID overrepresented gene ontology terms..
I've got a script to do all the parsing of various GO terms, so that part is done, and I'd really prefer to use a nuts and bolts approach from the bottom up than to use something like TopGO..partly for my own educational purposes..
What I would like to do is this:
1) Nonspecific filtering to remove ~50-70% of genes that are 'absent' (Affy chips), don't change much, etc..
2) Use rank products to get two-tailed p-values for remaining genes. Rank products (papers in FEBS letters and Bioinformatics.. pubmed for 'Breitling r[au]' and you'll find 'em)
3) Use ks.test() to compare genes from each ontology category to the rest of the genes that are not represented in that ontology.. specifically, i'd input two vectors to ks.test() - one would be the p-values from rank products from each ontology, the other would be the remaining rank products p-values from all other genes that are not in the ontology.. e.g.
x <- c(0.04, 0.003, 0.19, 1.0, 0.57, 0.0001, 0, 0.89, 0.77, 0.0245) # GO category of interest p-values
y <- rnorm(6000, 0, 1) # Remaining p-values for all other retained genes after nonspec filtering (I can't type 3,000 rank products' p-values here, so I use rnorm and keep positives)
y <- subset(y, y>0) # only keep positive simulated p-values; gives ~ 3,000
ks.test(x,y)
The second and third lines of code are just to create a normal distribution of p-values here for illustration purposes; in reality I would use p-values generated by rank products..not exactly normal but probably the best that can be approximated.
I would think that using the p-values generated by rank products algorithm would deal with multiple testing, as rank products p-values are simultaneously adjusted for FDR by using permutations of sample labels to estimate distribution. This also eliminates correlated gene dependencies..
I'd also expect that using two-tailed p-values eliminates the issue of having to separate up-regulated and down-regulated genes? This issue is discussed breifly in several places, and probably most lucidly in 'How to use GOstats testing gene lists for GO term association' by Seth Falcon and Robert Gentleman [google 'GOstatsHyperG.pdf'] and there the authors speculate that using the two-tailed p-values is suboptimal. The reason the authors imply (unless I am mistaken) is because it is most desirable to find genes that go in the same direction. I dispute this point specifically in the context of gene ontology-related testing - quite often in genetics you can have a gain of function mutation in a pathway (say in mapk pathway). Say when a pathway component that normally inhibits another component gets a null mutation. Then the downstream component that was previously inhibited by the mutated gene is now upregulated, i.e. say you have
A -> B -x-> C
As a simple pathway where gene A normally induces gene B, and gene B normally inhibits gene C. If gene B is downregulated (say by null mutation), gene C is upregulated because B normally inhibits C, and that inhibition is lost by the null mutation in B. So in reality there are two genes affected in this simple pathway, but their directions are opposite (A is unchanged; B goes down; C goes up). Yet the pathway itself is profoundly changed. My point is that by keeping directionality and NOT using two-tailed p-values, you automatically limit detection of the scenario I just illustrated here.
Any comments, criticisms, advice would be much much obliged..
Thanks in advance,
-Ken
_________________________________________________________________
[[trailing spam removed]]
More information about the Bioconductor
mailing list