[R] Diff btw percentile and quantile

Wed Mar 4 19:55:13 CET 2009

Excel 2003's help for percentile just says it interpolates
between the quantiles in the data:  
   Array   is the array or range of data that defines relative standing.
   K   is the percentile value in the range 0..1, inclusive.
     If array is empty or contains more than 8,191 data points,
        PERCENTILE returns the #NUM! error value. 
     If k is nonnumeric, PERCENTILE returns the #VALUE! error value. 
     If k is < 0 or if k > 1, PERCENTILE returns the #NUM! error value. 
     If k is not a multiple of 1/(n - 1), PERCENTILE interpolates
        to determine the value at the k-th percentile. 
so some experimenation is on order.

I found that the call to R's quantile gives a different result
for each of the 9 documented values of the type argument:
   x<-c(1,1,2,3,3,5,8,8,9,10)
   quantile(x, probs=(0:8)/8, type=types[i])
E.g.,
sapply(1:9,function(type)quantile(x=x,probs=(0:8)/8,type=type))
      type=1 type=2 type=3 type=4 type=5 type=6 type=7    type=8
type=9
0%         1      1      1   1.00   1.00  1.000  1.000  1.000000
1.00000
12.5%      1      1      1   1.00   1.00  1.000  1.125  1.000000
1.00000
25%        2      2      1   1.50   2.00  1.750  2.250  1.916667
1.93750
37.5%      3      3      3   2.75   3.00  3.000  3.000  3.000000
3.00000
50%        3      4      3   3.00   4.00  4.000  4.000  4.000000
4.00000
62.5%      8      8      5   5.75   7.25  7.625  6.875  7.375000
7.34375
75%        8      8      8   8.00   8.00  8.250  8.000  8.083333
8.06250
87.5%      9      9      9   8.75   9.25  9.625  8.875  9.375000
9.34375
100%      10     10     10  10.00  10.00 10.000 10.000 10.000000
10.00000

I entered the same x into Excel 2003 and used the formulae
=percentile(A1:10,0),
=percentile(A1:A10,.125), ..., =percentile(A1:A10,1) and got the results
   1, 1.125, 2.25, 3, 4, 6.875, 8, 8.875, 10
This matches only R's type 7, the default.

They also match S+'s default quantile calculation.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com 

----------------------------
Ted Harding wrote:
> On 04-Mar-09 16:56:14, Wacek Kusnierczyk wrote:
> (Ted Harding) wrote:
> <snip>
>> So, with reference to your original question
>>  "Excel has percentile() function. R function quantile() does the
>>   same thing. Is there any significant difference btw percentile
>>   and quantile?"
>> the answer is that they in effect give the same results, though
>> differ with respect to how they are to be fed (quantile eats
>> probabilities, percentile eats percentages). [Though (since I am
>> not familiar with Excel) I cannot rule out that Excel's percentile()
>> function also eats probabilities; in which case its name would be
>> an example of sloppy nomenclature on Excel's part; which I cannot
>> rule out on general grounds either].
> 
> i am not familiar enough with excel to prove or disprove what you say
> above, but in general such claims should be grounded in the respective
> documentations. 
> 
> there are a number of ways to compute empirical quantiles (see, e.g.,
> [1]), and it's possible that the one used by r's quantile by default
> (see ?quantile) is not the one used by excel (where you probably have
> no choice;  help in oocalc does not specify the method, and i guess
> that excel's does not either).
> 
> have you actually confirmed that excel's percentile() does the same as
> r's quantile() (modulo the scaling)?
> vQ

I have now googled around a bit. All references to the Excel
percentile() function say that you feed it the fractional value
corresponding to the percentage. So, for example, to get the
80-th percentile you would give it 0.8. Hence Excel should call
it "quantile"!

As to the algorithm, Wikipedia states the following (translated
into R syntax):

  Many software packages, such as Microsoft Excel, use the
  following method recommended by NIST[4] to estimate the
  value, vp, of the pth percentile of an ascending ordered
  dataset containing N elements with values v[1],v[2],...,v[N]:

    n = (p/100)*(N-1) + 1

  n is then split into its integer component, k and decimal
  component, d, such that n = k + d.
  If k = 1, then the value for that percentile, vp, is the
  first member of the ordered dataset, v[1].
  If k = N, then the value for that percentile, vp, is the
  Nth member of the ordered dataset, v[N].
  Otherwise, 1 < k < N and vp = v[k] + d*(v[k + 1] - v[k]).

Note that the Wikipedia article uses the "%" interpretation of
"p-th percentile", i.e. the point which is (p/100) of the way
along the distribution.

It looks as though R's quantile with type=4 might be the same,
since it is explained as "linear interpolation of the empirical
cdf", which is what the above description of Excel's method does.
However, R's default type is 7, which is different.

Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at manchester.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 04-Mar-09                                       Time: 17:29:50
------------------------------ XFMail ------------------------------