[R] Confidence intervals of quantiles
(Ted Harding)
ted.harding at nessie.mcc.ac.uk
Tue Feb 6 00:41:18 CET 2007
On 05-Feb-07 Mike White wrote:
> Can anyone please tell me if there is a function to calculate
> confidence intervals for the results of the quantile function.
> Some of my data is normally distributed but some is also a
> squewed distribution or a capped normal distribution. Some of
> the data sets contain about 700 values whereas others are smaller
> with about 100-150 values, so I would like to see how the confidence
> intervals change for the different distributions and different data
> sizes.
As well as the bootstrap suggestion (which may do what you want)
I'd like to suggest calcualting exact CIs for the distribution
quantiles. I don't know of an R function which does this, though
the principle is straightforward enough (I once implemented it
for octave/matlab).
I'm assuming you're after a truly distribution-free CI (as your
query suggests). In that case, the sample quantiles provide the
CI limits for the distribution quantiles, with the proviso that
you will not in general get an exact match for the confidence
level P that you desire for your confidence interval, so you
need to use P as a lower bound for the confidence level.
I'll illustrate with an equi-tailed 95% CI.
Notation:
x[r] (r = 1:n) is the r-th order statistic of a sample of n
values of a random variable X, drawn from a continuous distribution.
X[r] is the corresponding random variable.
X(p) is the p-th quantile of the distribution in question,
i.e. the value of X such that P(X <= X[p]) = p. 0 < p < 1.
Objective: You want a 95% CI (X(p)L,X(p).R) for the quantile X(p)
for given p.
Principle:
P( X[r] <= X(p) ) = pbinom(r, n, p) [*]
(i.e. the probability that you get at least r of the x's
below X(p)).
Now, for the upper limit X(p).R of the CI, you want the
smallest value of X(p) such that the probability [*] is not
less than 0.925 (i.e. 1 - (1-P)/2 = 1 - (1-0.95)/2 ).
This is achieved by X(p).R = x[s] where
s = min(which(pbinom((0:n),n,p) >= 0.025))
and, similalry, for the lower limit, X(p).L = x[r] where
r = max(which(pbinom((0:n),n,p) <= 0.025))
(Note that the "which" counts the binomial case "r=0" as number 1)
If I've got my head around the above right, the following
function does the above job, and caould be fairly easily
modified for asymmetric confidence intervals (e.g. a 95%
confidence interval with 96% for inclusion below the upper
limit and 99% for inclusion above the lower limit).
q.CI<-function(x,p,P){
# x is the sample, p (0<p<1) the quantile, P the confidence level
x<-sort(x)
n<-length(x)
s <- min(which(pbinom((0:n),n,p) >= 1-(1-P)/2))
r <- max(which(pbinom((0:n),n,p) <= (1-P)/2))
c(x[r],x[s])
# x[r] is the lower limit, x[s] the upper limit, of the CI
}
Note that it gives a confidence level P at least equal to P,
not in general exactly equal to P.
I've tested it as follows (10000 simulations with samples
of size 101 from a Normal distribution -- why not, after
all? No loss of generality!):
p<-0.5; Xq<-qnorm(p); P<-0.95
N<-10000; incls<-numeric(N)
for(i in (1:N)){
x<-rnorm(101)
CI<-q.CI(x,p,P)
if((CI[1]<=Xq)&(CI[2]>=Xq)){ incls[i]<-1 }
}; sum(incls)/N
# [1] 0.9564
# [1] 0.9548
p<-0.75; Xq<-qnorm(p); P<-0.95
[etc. ... ]
# [1] 0.9631
# [1] 0.9596
p<-0.90; Xq<-qnorm(p); P<-0.95
[etc. ... ]
# [1] 0.9540
# [1] 0.9533
p<-0.95; Xq<-qnorm(p); P<-0.95
[etc. ... ]
# [1] 0.9840
# [1] 0.9826
p<-0.99; Xq<-qnorm(p); P<-0.95
[etc. ... ]
Error in if ((CI[1] <= Xq) & (CI[2] >= Xq)) { :
missing value where TRUE/FALSE needed
showing that for extreme values of p relative to the
sample size, trouble occurs (as is to be expected)!
The solution is to test for "NA" in CI[1] and/or CI[2],
so I modified my test routine as follows, to give notional
lower confidence limit -Inf if CI[1] is NA, and/or upper
confidence limit +Inf if CI[2] is NA (of course this
could be done in the function q.CI() itself, but perhaps
it offers more flexibility for the user to do something
else with it, if it is left as NA).
p<-0.99; Xq<-qnorm(p); P<-0.95
N<-10000; incls<-numeric(N)
for(i in (1:N)){
x<-rnorm(101)
CI<-q.CI(x,p,P)
if(is.na(CI[1])){ CI[1]<-(-Inf) }
if(is.na(CI[2])){ CI[2]<-(+Inf) }
if((CI[1]<=Xq)&(CI[2]>=Xq)){ incls[i]<-1 }
}; sum(incls)/N
# [1] 0.9841
# [1] 0.9821
Comments welcome!!!
Ted.
--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 05-Feb-07 Time: 23:40:23
------------------------------ XFMail ------------------------------
More information about the R-help
mailing list