[R] R function for percentrank

Wed Dec 5 18:42:40 CET 2007

I'm coming late to this, but this *does* need a correction
just for the archives !

>>>>> "MS" == Marc Schwartz <marc_schwartz at comcast.net>
>>>>>     on Sat, 01 Dec 2007 13:33:21 -0600 writes:

    MS> On Sat, 2007-12-01 at 18:40 +0000, David Winsemius wrote:
    >> David Winsemius <dwinsemius at comcast.net> wrote in 
    >> news:Xns99F989B3A3057dNOTwinscomcast at 80.91.229.13:
    >> 
    >> > "tom soyer" <tom.soyer at gmail.com> wrote in
    >> > news:65cc7bdf0712010951p451a993i70da89f285d801de at mail.gmail.com: 
    >> > 
    >> >> John,
    >> >> 
    >> >> The Excel's percentrank function works like this: if one has a number,
    >> >> x for example, and one wants to know the percentile of this number in
    >> >> a given data set, dataset, one would type =percentrank(dataset,x) in
    >> >> Excel to calculate the percentile. So for example, if the data set is
    >> >> c(1:10), and one wants to know the percentile of 2.5 in the data set,
    >> >> then using the percentrank function one would get 0.166, i.e., 2.5 is
    >> >> in the 16.6th percentile. 
    >> >> 
    >> >> I am not sure how to program this function in R. I couldn't find it as
    >> >> a built-in function in R either. It seems to be an obvious choice for
    >> >> a built-in function. I am very surprised, but maybe we both missed it.
    >> >  
    >> > My nomination for a function with a similar result would be ecdf(), the 
    >> > empirical cumulative distribution function. It is of class "function" 
    >> so 
    >> > efforts to index ecdf(.)[.] failed for me.

I think you did not understand ecdf() !!!
It *returns* a function,
that you can then apply to old (or new) data; see below

    MS> You can use ls.str() to look into the function environment:

    >> ls.str(environment(ecdf(x)))
    MS> f :  num 0
    MS> method :  int 2
    MS> n :  int 25
    MS> x :  num [1:25] -2.215 -1.989 -0.836 -0.820 -0.626 ...
    MS> y :  num [1:25] 0.04 0.08 0.12 0.16 0.2 0.24 0.28 0.32 0.36 0.4 ...
    MS> yleft :  num 0
    MS> yright :  num 1

    MS> You can then use get() or mget() within the function environment to
    MS> return the requisite values. Something along the lines of the following
    MS> within the function percentrank():

    MS> percentrank <- function(x, val)
    MS> {
    MS> env.x <- environment(ecdf(x))
    MS> res <- mget(c("x", "y"), env.x)
    MS> Ind <- which(sapply(seq(length(res$x)),
    MS> function(i) isTRUE(all.equal(res$x[i], val))))
    MS> res$y[Ind]
    MS> }

sorry Marc, but "Yuck !!"

- this  percentrank() only works when you apply it to original x[i] values
- only works for 'val' of length 1
- is a complicated hack

and absolutely unneeded  (see below)

    MS> Thus:

    MS> set.seed(1)
    MS> x <- rnorm(25)

    >> x
    MS> [1] -0.62645381  0.18364332 -0.83562861  1.59528080  0.32950777
    MS> [6] -0.82046838  0.48742905  0.73832471  0.57578135 -0.30538839
    MS> [11]  1.51178117  0.38984324 -0.62124058 -2.21469989  1.12493092
    MS> [16] -0.04493361 -0.01619026  0.94383621  0.82122120  0.59390132
    MS> [21]  0.91897737  0.78213630  0.07456498 -1.98935170  0.61982575

    >> percentrank(x, 0.48742905)
    MS> [1] 0.56

[gives 0.52 in my version of R ]

Well, that is *THE SAME*  as using  ecdf() the way you 
should have used it :

  ecdf(x)(0.48742905)

{in two lines, that is

  mypercR <- ecdf(x)
  mypercR(0.48742905)

 which maybe easier to understand, if you have never used the
 nice concept that underlies all of

 approxfun(), splinefun() or ecdf()
}

You can also use

  ecdf(x)(x)

and indeed check that it is identical to the convoluted
percentrank() function above :

> ecdf(x)(0.48742905)
[1] 0.52
> ecdf(x)(x)
 [1] 0.20 0.44 0.12 1.00 0.48 0.16 0.56 0.72 0.60 0.28 0.96 0.52 0.24 0.04 0.92
[16] 0.32 0.36 0.88 0.80 0.64 0.84 0.76 0.40 0.08 0.68
> all(ecdf(x)(x) == sapply(x, function(v) percentrank(x,v)))
[1] TRUE
> 

Regards (and apologies for my apparent indignation ;-)
by the author of ecdf() ,

Martin Maechler, ETH Zurich  

    MS> One other approach, which returns the values and their respective rank
    MS> percentiles is:

     >> cumsum(prop.table(table(x)))

    [...... snip ........]