[BioC] multtest MTP raw P values = 0
Shtatland, Timur
TSHTATLAND at MGH.HARVARD.EDU
Tue Mar 4 06:00:59 CET 2008
Houston,
Thank you very much for your advice!
1) Yes, I could reproduce both the problem and your solution with a
much smaller data set (see code below).
I made a toy dataset with just 1 (one) gene and 2 groups of 5 samples
each, with all values in group 1 lower than values in group
2. Bootstrap with 100 and 1000 iterations gave P=0, and with 10,000
and 100,000 gave P>0, the latter values also consistent with both
the exact permutation test and t-test. It would take a lot of cpu time
to do B >= 10000 iterations with my real data set (and zeros would
still be possible), but I think there is no other way around it.
2) I used ssmaxT procedure with augmentation to get FDR, reported as
adjp. I used this method instead of Benjamini-Hochberg ("BH" in
mt.rawp2adjp) because it requires less assumptions, although it is
"conservative". When "BH" in mt.rawp2adjp is used, the adjp become
much lower, as expected (see below). BTW, with BH, the problem I
reported (same rawp values correspond to different adjp) is gone.
I am not sure which of these 2 methods give more realistic FDR
estimates. I could not find a consensus on multiple testing correction
procedures. There are reasons not to use Benjamini-Hochberg FDR
method, see for example:
http://view.ncbi.nlm.nih.gov/pubmed/16644791
Jung SH, Jang W. Bioinformatics. 2006 Jul 15;22(14):1730-6.
http://www.urmc.rochester.edu/smd/biostat/people/faculty/AOAS102.pdf
Gordon, A., Glazko, G., Qiu, X., and Yakovlev, A. (2007). The Annals
of Applied Statistics 1:179-190.
Thanks again.
Best regards,
Timur
--
Timur Shtatland, PhD
Center for Molecular Imaging Research
Massachusetts General Hospital
149 13th Street, Room 5408
Charlestown, MA 02129
tshtatland at mgh dot harvard dot edu
############################################################
> library(multtest)
> nSam <- 5
> X <- matrix(c(1:nSam, 1:nSam+10), nrow=1)
> print(X)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 1 2 3 4 5 11 12 13 14 15
> Y <- c(rep(0,nSam), rep(1,nSam))
> print(Y)
[1] 0 0 0 0 0 1 1 1 1 1
>
> seed <- 99
> for (B in c(100, 1000, 10000, 100000)) {
+ TTBoot <- MTP(X=X, Y=Y, test = "t.twosamp.unequalvar", alternative =
"two.sided", typeone="fdr", method="ss.maxT", fdr.method="conservative", B=B,
seed=seed)
+ MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp)
+ cat(sprintf("%d iterations:\n", B))
+ print(format(MTPRes, digits=10, scientific=TRUE))
+ }
running bootstrap...
iteration = 100
100 iterations:
PRaw PAdj
1 0e+00 0e+00
running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000
1000 iterations:
PRaw PAdj
1 0e+00 0e+00
running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500
1600 1700 1800 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100
3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400 4500 4600 4700
4800 4900 5000 5100 5200 5300 5400 5500 5600 5700 5800 5900 6000 6100 6200 6300
6400 6500 6600 6700 6800 6900 7000 7100 7200 7300 7400 7500 7600 7700 7800 7900
8000 8100 8200 8300 8400 8500 8600 8700 8800 8900 9000 9100 9200 9300 9400 9500
9600 9700 9800 9900 10000
10000 iterations:
PRaw PAdj
1 3e-04 6e-04
running bootstrap...
iteration = 100 200 300...
99900 100000
100000 iterations:
PRaw PAdj
1 1e-04 2e-04
>
> library(exactRankTests)
> perm.test(X[Y==1],X[Y==0])
2-sample Permutation Test
data: X[Y == 1] and X[Y == 0]
T = 65, p-value = 0.007937
alternative hypothesis: true mu is not equal to 0
> t.test(X[Y==1],X[Y==0])
Welch Two Sample t-test
data: X[Y == 1] and X[Y == 0]
t = 10, df = 8, p-value = 8.488e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7.693996 12.306004
sample estimates:
mean of x mean of y
13 3
############################################################
## same rawp, added BH:
> TTMargFDR <- mt.rawp2adjp(rawp=TTBoot at rawp, proc="BH")
> MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp, PAdjBH=
TTMargFDR$adjp[order(TTMargFDR$index),-1])
> print(format(subset(MTPRes[order(MTPRes$PRaw, MTPRes$PAdj), ], PRaw < 2e-3),
digits=10, scientific=TRUE))
PRaw PAdj PAdjBH
202508_s_at 0e+00 0.000000000e+00 0.000000000e+00
203130_s_at 0e+00 0.000000000e+00 0.000000000e+00
206780_at 0e+00 0.000000000e+00 0.000000000e+00
218820_at 0e+00 0.000000000e+00 0.000000000e+00
205825_at 0e+00 8.000000000e-03 0.000000000e+00
203000_at 0e+00 1.400000000e-02 0.000000000e+00
204465_s_at 0e+00 1.400000000e-02 0.000000000e+00
206282_at 0e+00 1.400000000e-02 0.000000000e+00
206502_s_at 0e+00 1.400000000e-02 0.000000000e+00
208399_s_at 0e+00 1.400000000e-02 0.000000000e+00
209598_at 0e+00 1.400000000e-02 0.000000000e+00
209755_at 0e+00 1.400000000e-02 0.000000000e+00
212805_at 0e+00 1.400000000e-02 0.000000000e+00
203001_s_at 0e+00 2.400000000e-02 0.000000000e+00
204811_s_at 0e+00 2.400000000e-02 0.000000000e+00
205174_s_at 0e+00 2.400000000e-02 0.000000000e+00
...
218816_at 1e-03 1.000000000e+00 1.051067961e-02
219939_s_at 1e-03 1.000000000e+00 1.051067961e-02
221020_s_at 1e-03 1.000000000e+00 1.051067961e-02
############################################################
############################################################
On Feb 29, 2008, at 7:26 PM, Houston Gilbert wrote:
Two things:
1.) The rawp digits are exactly zero because they are calculated as follows
(or similarly):
Here - Tn are the observed test statistics, Z is the matrix of bootstrap
test statistics.
rawp <- apply(Tn < Z, 1, mean)
This means that if no observed test statistics are greater than the sample
of resampled null test statistics (marginally, within rows), the raw
p-values will be zero, not 1/B. The rawp here are very coarse estimates and
really reflect the probability of seeing a value of test statistic that
extreme given each marginal null distibution. This is a matter of taste I
guess, and might be something for to look into. If you want non-zero
p-values, I imagine the best way to increase B, since you will get more
accurate resolution in the tails of your test statistics null distribution
estimate. They may still just be zero, though.
2.) The procedure you are actually doing is an augmentation procedure which
takes the _adjp_ calculated from the ssmaxT FWER-controlling procedure and
augments them to control the FDR in one of two ways (here,
"conservative"-ly"). Thus, your final FDR adjp are the FWER ssmaxT adjusted
p-values calculated at the observed values of the test statistics (based off
of an ecdf() call performed over column maxima of the null matrix) which
then get augmented to FDR-controlling p-values. So I imagine the ssmaxT
adjp to be way more smooth. The equality of adjps in the output is also
often the byproduct of enforcing some kind of monotonicity constraint on the
p-values themselves when going over subsets of rows.
As an aside: The augmentation procedures can be conservative in practice.
You may just want to try "BH" in mt.rawp2adjp and see what you get.
Houston
Houston Gilbert
Doctoral Student in Biostatistics
University of California, Berkeley
houston at stat.berkeley.edu
www.stat.berkeley.edu/~houston
On 2/29/08, James Bullard <bullard at berkeley.edu> wrote:
a post on the bioconductor mailing list - if you are not already on it
Begin forwarded message:
From: Timur Shtatland <tshtatland at mgh.harvard.edu>
Date: February 29, 2008 10:37:28 AM PST
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] multtest MTP raw P values = 0
Dear all,
I used MTP from the multtest package to compute bootstrap based raw
and adjusted t-test P values (group 1 = 7 samples, group 2 = 3
samples). Why are some raw P values in the output exactly zero?
I could not change this by using scientific notation, with varying
number of significant digits. In the example below, 10 digits are
used; increasing this does not make a difference. I expected all P
values > 0, with the minimum P value determined by the number of
bootstrap iterations, here 1000. Hence the raw P value minimum should
be 1e-03, as indeed the lowest *positive* P values are.
Another question is: why some raw P values that are *equal* correspond
to adjusted P values that are *different*?
For example, raw P = 1e-03 corresponds to adjusted P = 8.4e-02,
8.6e-02, etc.
I could not find the answer in the MTP help page, its vignettes, FAQ,
etc. A similar, but not identical, problem was reported on this
mailing list before, without a solution:
https://stat.ethz.ch/pipermail/bioconductor/2007-December/020492.html
Thank you for your help.
Best regards,
Timur
--
Timur Shtatland, PhD
Center for Molecular Imaging Research
Massachusetts General Hospital
149 13th Street, Room 5408
Charlestown, MA 02129
tshtatland at mgh dot harvard dot edu
############################################################
B <- 1000
seed <- 99
TTBoot <- MTP(X=esetGcrmaExprsFiltered, Y=TT, test =
"t.twosamp.unequalvar", alternative = "two.sided", typeone="fdr",
method="ss.maxT", fdr.method="conservative", B=B, seed=seed)
running bootstrap...
iteration = 100 200 300 400 500 600 700 800 900 1000
print(TTBoot)
Multiple Testing Procedure
Object of class: MTP
sample size = 10
number of hypotheses = 5413
test statistics = t.twosamp.unequalvar
type I error rate = fdr
nominal level alpha = 0.05
multiple testing procedure = ss.maxT
Call: MTP(X = esetGcrmaExprsFiltered, Y = TT, test =
"t.twosamp.unequalvar",
alternative = "two.sided", typeone = "fdr", fdr.method =
"conservative",
B = B, method = "ss.maxT", seed = seed)
Slots:
Class Mode Length Dimension
statistic numeric numeric 5413
estimate numeric numeric 5413
sampsize integer numeric 1
rawp numeric numeric 5413
adjp numeric numeric 5413
conf.reg array logical 0 0,0,0
cutoff matrix logical 0 0,0
reject matrix logical 5413 5413,1
nulldist matrix numeric 5413000 5413,1000
call call call 10
seed integer numeric 1
...
MTPRes <- data.frame(PRaw=TTBoot at rawp, PAdj=TTBoot at adjp)
print(format(subset(MTPRes[order(MTPRes$PRaw, MTPRes$PAdj), ],
PAdj < 0.1), digits=10, scientific=TRUE))
PRaw PAdj
202508_s_at 0e+00 0.000000000e+00
203130_s_at 0e+00 0.000000000e+00
206780_at 0e+00 0.000000000e+00
218820_at 0e+00 0.000000000e+00
205825_at 0e+00 8.000000000e-03
203000_at 0e+00 1.400000000e-02
204465_s_at 0e+00 1.400000000e-02
206282_at 0e+00 1.400000000e-02
206502_s_at 0e+00 1.400000000e-02
208399_s_at 0e+00 1.400000000e-02
209598_at 0e+00 1.400000000e-02
209755_at 0e+00 1.400000000e-02
212805_at 0e+00 1.400000000e-02
203001_s_at 0e+00 2.400000000e-02
204811_s_at 0e+00 2.400000000e-02
205174_s_at 0e+00 2.400000000e-02
205646_s_at 0e+00 2.400000000e-02
206051_at 0e+00 2.400000000e-02
212624_s_at 0e+00 2.800000000e-02
204035_at 0e+00 3.000000000e-02
221261_x_at 0e+00 3.200000000e-02
206915_at 0e+00 3.400000000e-02
206104_at 0e+00 4.000000000e-02
210036_s_at 0e+00 5.000000000e-02
218380_at 0e+00 5.000000000e-02
213135_at 0e+00 6.200000000e-02
204870_s_at 0e+00 6.400000000e-02
218952_at 0e+00 6.400000000e-02
205120_s_at 0e+00 6.600000000e-02
211597_s_at 0e+00 6.666666667e-02
205348_s_at 0e+00 6.800000000e-02
212190_at 0e+00 7.000000000e-02
213186_at 0e+00 7.000000000e-02
203485_at 0e+00 7.200000000e-02
203572_s_at 0e+00 8.400000000e-02
209583_s_at 0e+00 8.400000000e-02
212895_s_at 0e+00 8.400000000e-02
206001_at 0e+00 8.600000000e-02
206135_at 0e+00 8.600000000e-02
212843_at 0e+00 8.600000000e-02
204059_s_at 0e+00 8.800000000e-02
210826_x_at 0e+00 9.000000000e-02
213438_at 0e+00 9.000000000e-02
218675_at 0e+00 9.000000000e-02
201116_s_at 0e+00 9.200000000e-02
203272_s_at 0e+00 9.600000000e-02
204793_at 0e+00 9.800000000e-02
204720_s_at 1e-03 8.400000000e-02
221207_s_at 1e-03 8.400000000e-02
204540_at 1e-03 8.600000000e-02
209992_at 1e-03 8.888888889e-02
210246_s_at 1e-03 9.000000000e-02
print(dim(esetGcrmaFiltered))
Features Samples
5413 10
sessionInfo()
R version 2.6.0 (2007-10-03)
powerpc-apple-darwin8.10.1
locale:
en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines tools stats graphics grDevices utils
datasets methods base
other attached packages:
[1] multtest_1.18.0 affydata_1.11.3 affyQCReport_1.16.0
geneplotter_1.16.0 lattice_0.16-5
[6] annotate_1.16.1 AnnotationDbi_1.0.6 RSQLite_0.6-4
DBI_0.2-4 RColorBrewer_1.0-2
[11] affyPLM_1.14.0 xtable_1.5-2 simpleaffy_2.14.05
gcrma_2.10.0 matchprobes_1.10.0
[16] genefilter_1.16.0 survival_2.32 affy_1.16.0
preprocessCore_1.0.0 affyio_1.6.1
[21] Biobase_1.16.1
loaded via a namespace (and not attached):
[1] KernSmooth_2.22-21 grid_2.6.0
version
_
platform powerpc-apple-darwin8.10.1
arch powerpc
os darwin8.10.1
system powerpc, darwin8.10.1
status
major 2
minor 6.0
year 2007
month 10
day 03
svn rev 43063
language R
version.string R version 2.6.0 (2007-10-03)
# platform: PowerPC G4, Mac OS 10.4.11.
The information transmitted in this electronic communica...{{dropped:16}}
More information about the Bioconductor
mailing list