[BioC] question/issue with multtest mt.maxT
Les Dethlefsen
dethlefs at stanford.edu
Mon May 30 19:17:54 CEST 2011
Hi All,
I'm a biologist, not a statistician, and a bit of a newbie with R, so please forgive me if this is a stupid question, and thanks for your help.
My data are from shotgun metagenomic sequencing of multiple samples, but it's essentially similar to microarray data. I have a matrix of 3312 functional gene categories by 24 samples, with real-valued entries indicating the proportion of interpretable gene sequence data for each sample that is assigned to each functional category. There are a number of ways to bin the samples into categories, based on some other analysis I'm most interested in a breakdown into two subjects and 5 temporal categories within each subject, i.e. a total of 10 categories. Two of the categories have 4 samples each, the remaining 8 categories have 2 samples each.
When I ask mt.maxT to carry out a 2-sided F test on the 10 categories, I see something I don't understand: the adjusted p values are not a monotonic transformation of the raw p values. They are a monotonic with respect to the test statistic...as the test statistic falls, the adjusted p values rise, as expected. But the raw p values are bouncing around quite a bit, with the lowest raw p values of 0.0001 having a range of adjusted p values, all much larger than the minimum adjusted p value. There are 244 raw p values lower than that which corresponds to the minimum adjusted p value.
My (perhaps mistaken) understanding was that permutations were used to generate the raw p values (which I definitely want, since I don't think the distributions of values for the different functional gene categories will follow any regular distribution or be similar to each other), and then some sort of adjustment was made to the raw values to adjust for multiple testing considerations, but the adjustment itself was not based on permutations. I would expect the permutations to be fairly noisy in my case with only 2 samples in most of my sample categories, but wouldn't that only affect the reliability of the raw p values? How can the adjusted p values have a different rank order than the raw p values?
I did run the same command with the samples binned only to subject, i.e. only 2 categories of 12 subject each. While the range of p values is much lower (not surprisingly) there is still the same pattern with a monotonic relationship between the test statistic and the adjusted p values, but not between the raw p values and adjusted p values. Hence, it doesn't seem that this behavior is due to the messiness of 10 unequally sized categories for only 24 samples.
I'm including some snips of my R session below; if anyone is interested in playing with the original data there's a compressed file at
ftp://asiago.stanford.edu/multtest_issue.tgz
that has the data matrix and the class labels.
I'm running R 2.13.0 (on Mac OSX with the R.app GUI) with newly-installed BioConductor, so everything should be up to date.
Thanks for any help or insight!
Les
Les Dethlefsen
Relman Lab
Microbiology & Immunology
Stanford University
R version 2.13.0 (2011-04-13)
Copyright (C) 2011 The R Foundation for Statistical Computing
ISBN 3-900051-07-0
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
[R.app GUI 1.40 (5751) x86_64-apple-darwin9.8.0]
> ko.data.df=as.data.frame(read.table("K57.fvn.ko.txt",header=TRUE,sep="",row.names=1))
> class.label.df=as.data.frame(t(read.table("ClassLabel2.txt",header=FALSE,sep="",row.names=1)))
> class.label.df
Sub CpNon Cp12Non CpPIP Cp12PIP Cp12PELP SubCp12PIP
V2 0 0 0 0 0 0 0
V3 0 0 0 0 0 0 0
V4 0 1 1 1 1 1 1
V5 0 1 1 1 1 1 1
V6 0 0 0 2 2 2 2
V7 0 0 0 2 2 2 2
V8 0 0 0 2 2 3 2
V9 0 0 0 2 2 3 2
V10 0 1 2 1 3 4 3
V11 0 1 2 1 3 4 3
V12 0 0 0 3 4 5 4
V13 0 0 0 3 4 5 4
V14 1 0 0 0 0 0 5
V15 1 0 0 0 0 0 5
V16 1 1 1 1 1 1 6
V17 1 1 1 1 1 1 6
V18 1 0 0 2 2 2 7
V19 1 0 0 2 2 2 7
V20 1 0 0 2 2 3 7
V21 1 0 0 2 2 3 7
V22 1 1 2 1 3 4 8
V23 1 1 2 1 3 4 8
V24 1 0 0 3 4 5 9
V25 1 0 0 3 4 5 9
> ko.SubCp12PIP.maxT.F=mt.maxT(ko.data.df,classlabel=class.label.df$SubCp12PIP,test="f",side="abs",nonpara="n")
b=100 b=200 b=300 b=400 b=500 b=600 b=700 b=800 b=900 b=1000
b=1100 b=1200 b=1300 b=1400 b=1500 b=1600 b=1700 b=1800 b=1900 b=2000
b=2100 b=2200 b=2300 b=2400 b=2500 b=2600 b=2700 b=2800 b=2900 b=3000
b=3100 b=3200 b=3300 b=3400 b=3500 b=3600 b=3700 b=3800 b=3900 b=4000
b=4100 b=4200 b=4300 b=4400 b=4500 b=4600 b=4700 b=4800 b=4900 b=5000
b=5100 b=5200 b=5300 b=5400 b=5500 b=5600 b=5700 b=5800 b=5900 b=6000
b=6100 b=6200 b=6300 b=6400 b=6500 b=6600 b=6700 b=6800 b=6900 b=7000
b=7100 b=7200 b=7300 b=7400 b=7500 b=7600 b=7700 b=7800 b=7900 b=8000
b=8100 b=8200 b=8300 b=8400 b=8500 b=8600 b=8700 b=8800 b=8900 b=9000
b=9100 b=9200 b=9300 b=9400 b=9500 b=9600 b=9700 b=9800 b=9900 b=10000
> ko.SubCp12PIP.maxT.F
index teststat rawp adjp
K05394 1975 4.536904e+03 0.0278000 0.0896
K01455 648 4.746965e+02 0.0238000 0.4072
K02394 1154 2.566388e+02 0.0243000 0.6114
K06990 2301 5.749225e+01 0.0297000 0.8393
K05795 2021 5.399053e+01 0.0157000 0.8511
K06641 2223 4.468233e+01 0.0265000 0.8778
K05916 2059 3.936887e+01 0.0296000 0.9033
K08321 2678 3.068437e+01 0.0001000 0.9339
K02074 1047 2.780696e+01 0.0044000 0.9476
K07783 2616 2.692854e+01 0.0023000 0.9512
K07318 2472 2.594598e+01 0.0297000 0.9572
K08137 2639 2.563818e+01 0.0297000 0.9596
K02479 1207 2.561986e+01 0.0297000 0.9596
K02082 1052 2.484221e+01 0.0252000 0.9614
K02771 1312 2.433006e+01 0.0265000 0.9667
K03446 1580 2.314993e+01 0.0028000 0.9694
K07214 2423 2.232653e+01 0.0001000 0.9757
K03325 1529 1.881246e+01 0.0001000 0.9850
K09124 2747 1.813063e+01 0.0002000 0.9858
K03762 1790 1.802233e+01 0.0303000 0.9866
K05791 2018 1.781049e+01 0.0309000 0.9876
K03739 1776 1.685187e+01 0.0265000 0.9894
#################### SNIP ################################
> ko.Sub.maxT.F=mt.maxT(ko.data.df,classlabel=class.label.df$Sub,test="f",side="abs",nonpara="n")
b=100 b=200 b=300 b=400 b=500 b=600 b=700 b=800 b=900 b=1000
b=1100 b=1200 b=1300 b=1400 b=1500 b=1600 b=1700 b=1800 b=1900 b=2000
b=2100 b=2200 b=2300 b=2400 b=2500 b=2600 b=2700 b=2800 b=2900 b=3000
b=3100 b=3200 b=3300 b=3400 b=3500 b=3600 b=3700 b=3800 b=3900 b=4000
b=4100 b=4200 b=4300 b=4400 b=4500 b=4600 b=4700 b=4800 b=4900 b=5000
b=5100 b=5200 b=5300 b=5400 b=5500 b=5600 b=5700 b=5800 b=5900 b=6000
b=6100 b=6200 b=6300 b=6400 b=6500 b=6600 b=6700 b=6800 b=6900 b=7000
b=7100 b=7200 b=7300 b=7400 b=7500 b=7600 b=7700 b=7800 b=7900 b=8000
b=8100 b=8200 b=8300 b=8400 b=8500 b=8600 b=8700 b=8800 b=8900 b=9000
b=9100 b=9200 b=9300 b=9400 b=9500 b=9600 b=9700 b=9800 b=9900 b=10000
> ko.Sub.maxT.F
index teststat rawp adjp
K07214 2423 4.388760e+01 0.0001 0.0017
K01812 867 3.582776e+01 0.0001 0.0072
K01209 554 3.493056e+01 0.0001 0.0087
K03111 1446 3.368798e+01 0.0001 0.0110
K03313 1519 3.286514e+01 0.0001 0.0129
K05878 2049 3.030097e+01 0.0001 0.0212
K01188 539 2.861773e+01 0.0001 0.0285
K07053 2349 2.625740e+01 0.0001 0.0451
K09705 2793 2.596417e+01 0.0001 0.0476
K07277 2451 2.584051e+01 0.0001 0.0492
K03543 1633 2.551410e+01 0.0003 0.0526
K10852 2989 2.493226e+01 0.0003 0.0594
K01191 541 2.465159e+01 0.0001 0.0634
K02238 1108 2.458557e+01 0.0003 0.0651
K03701 1751 2.365698e+01 0.0001 0.0812
K06998 2305 2.344949e+01 0.0001 0.0856
K12340 3131 2.284225e+01 0.0001 0.0983
K01678 779 2.280762e+01 0.0004 0.0996
K03771 1796 2.255402e+01 0.0006 0.1042
K03325 1529 2.252884e+01 0.0001 0.1047
K00647 262 2.246549e+01 0.0001 0.1061
K00282 135 2.227595e+01 0.0002 0.1099
K03657 1726 2.220074e+01 0.0001 0.1122
K01551 698 2.161330e+01 0.0001 0.1278
K05970 2072 2.160496e+01 0.0005 0.1281
K03453 1583 2.131683e+01 0.0003 0.1387
K08321 2678 2.118411e+01 0.0001 0.1428
K01811 866 2.092958e+01 0.0003 0.1512
K09687 2783 2.025709e+01 0.0001 0.1788
K07407 2494 1.909975e+01 0.0002 0.2376
K11741 3080 1.908456e+01 0.0001 0.2381
#################### SNIP ################################
More information about the Bioconductor
mailing list