[Rd] p.adjust(<NA>s), was "Re: [BioC] limma and p-values"
Gordon Smyth
smyth at wehi.edu.au
Sun Jan 16 09:44:26 CET 2005
The new committed version of p.adjust() contains some problems:
> p.adjust(c(0.05,0.5),method="hommel")
[1] 0.05 0.50
No adjustment!
I can't see how the new treatment of NAs can be justified. One needs to
distinguish between NAs which represent missing p-values and NAs which
represent unknown p-values. In virtually all applications giving rise to
NAs, the NAs represent missing p-values which could not be computed because
of missing data. In such cases, the observed p-values should definitely be
adjusted as if the NAs weren't there, because NAs represent p-values which
genuinely don't exist.
I can only think of one situation in which the NAs might represent unknown
but existing p-values. This would be when a large experiment has been
conducted leading to many p-values. Instead of inputing all the p-values to
the p.adjust() function, you decide to enter only the smallest p-values and
represent the others using NAs. The trouble with this approach is that it
can only be used with method="bonferroni". All the other adjustment methods
are step-up or step-down methods or involve closure like Hommel's method.
For these methods, you simply have to know all the p-values before you can
adjust any of them.
For example, you're returning
> p.adjust(c(0.05,NA,NA,NA,NA,NA,NA,NA,NA,NA),method="fdr")
[1] 0.5 NA NA NA NA NA NA NA NA NA
But the unknown p-values might have been like this:
>
p.adjust(c(0.05,0.051,0.051,0.051,0.051,0.051,0.051,0.051,0.051,0.051),method="fdr")
[1] 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051 0.051
in which case the first adjusted p-value would have been 0.051 not 0.5.
How can you justify returning 0.5 for the first adjusted p-value, when the
correct value could actually be a factor of 10 lower, even when the first
p-value is in fact the smallest of the p-values?
At 07:29 AM 9/01/2005, Martin Maechler wrote:
>I've thought more and made experiements with R code versions
>and just now committed a new version of p.adjust() to R-devel
>--> https://svn.r-project.org/R/trunk/src/library/stats/R/p.adjust.R
>which does sensible NA handling by default and
>*additionally* has an "na.rm" argument (set to FALSE by
>default). The extended 'Examples' secion on the help page
> https://svn.r-project.org/R/trunk/src/library/stats/man/p.adjust.Rd
>shows how the new NA handling is typically much more sensible
>than using "na.rm = TRUE".
>
>Martin
>
>
> >>>>> "MM" == Martin Maechler <maechler at stat.math.ethz.ch>
> >>>>> on Sat, 8 Jan 2005 17:19:23 +0100 writes:
>
> >>>>> "GS" == Gordon K Smyth <smyth at wehi.edu.au>
> >>>>> on Sat, 8 Jan 2005 01:11:30 +1100 (EST) writes:
>
> MM> <.............>
>
> GS> p.adjust() unfortunately gives incorrect results when
> GS> 'p' includes NAs. The results from topTable are
> GS> correct. topTable() takes care to remove NAs before
> GS> passing the values to p.adjust().
>
> MM> There's at least one bug in p.adjust(): The "hommel"
> MM> method currently does not work at all with NAs (and I
> MM> have an uncommitted fix ready for this bug). OTOH, the
> MM> current version of p.adjust() ``works'' with NA's, apart
> MM> from Hommel's method, but by using "n = length(p)" in
> MM> the correction formulae, i.e. *including* the NAs for
> MM> determining sample size `n' {my fix to "hommel" would do
> MM> this as well}.
>
> MM> My question is what p.adjust() should do when there are
> MM> NA's more generally, or more specifically which `n' to
> MM> use in the correction formula. Your proposal amounts to
> MM> ``drop NA's and forget about them till the very end''
> MM> (where they are wanted in the result), i.e., your sample
> MM> size `n' would be sum(!is.na(p)) instead of length(p).
My approach to NAs (in the topTable function is the limma package) is
correct in the microarray context where the NAs represent missing
(non-existant) p-values which could not be computed.
> MM> To me it doesn't seem obvious that this setting "n =
> MM> #{non-NA observations}" is desirable for all P-value
> MM> adjustment methods. One argument for keeping ``n = #{all
> MM> observations}'' at least for some correction methods is
> MM> the following "continuity" one:
>
> MM> If only a few ``irrelevant'' (let's say > 0.5) P-values
> MM> are replaced by NA, the adjusted relevant small P-values
> MM> shouldn't change much, ideally not at all. I'm really
> MM> no scholar on this topic, but e.g. for "holm" I think I
> MM> would want to keep ``full n'' because of the above
> MM> continuity argument.
I don't see how the treatment of NAs follows from this continuity argument.
The argument seems to be rather informal and doesn't obviously relate to
the concepts of power and control of FWER and FDR, which is what the
adjustment method theory is based on.
The NAs should surely be treated in a consistent way for all the adjustment
methods.
> BTW, for "fdr", I don't see a
> MM> straightforward way to achieve the desired continuity.
> MM> 5D Of course, p.adjust() could adopt the possibility of
> MM> chosing how NA's should be treated e.g. by another
> MM> argument ``use.na = TRUE/FALSE'' and hence allow both
> MM> versions.
>
> MM> Feedback very welcome, particularly from ``P-value
> MM> experts'' ;-)
>
> MM> Martin Maechler, ETH Zurich
Gordon
More information about the R-devel
mailing list