[R] Fast R implementation of Gini mean difference

Adelchi Azzalini azzalini at stat.unipd.it
Thu Apr 24 15:41:42 CEST 2003


On Thursday 24 April 2003 05:17, you wrote:
> I have written the following function to calculate the weighted mean
> difference for univariate data (see
> http://www.xycoon.com/gini_mean_difference.htm for a related
> formula). Unsurprisingly, the function is slow (compared to sd or mad)
> for long vectors. I wonder if there's a way to make the function
> faster, short of creating an external C function. Thanks very much
> for your advice.
>
>
> gmd <- function(x, w) { # x=data vector, w=weights vector
>    n <- length(x)
>    tmp <- 0
>    for (i in 1:n) {
>       for (j in 1:n) {
>          tmp <- tmp + w[i]*abs(x[i]-x[j])
>       }
>    }
>    retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
> }

I have a few remarks on the "definition" of Gini index, and one on the most 
suitable way to compute it.

(a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor
   in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm
(b) when the values of x vector have frequencies, then the common
   definition knwon to me is not the above one, but one such that the nested line 
   above should be replaced by
         tmp <- tmp + w[i]*w[j]*abs(x[i]-x[j])
   and the final assignment should be
          0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w))
  In fact, the above function produces inconsident results; see this:
>  gmd(c(1,2,2,4),c(1,1,1,1))
[1] 1.329
> gmd(c(1,2,4), c(1,2,1))
[1] 1.662

while the above modifications lead again to 
 > gmd(c(1,2,4), c(1,2,1))
[1] 1.329


(c) from the computational viewpoint, there is a much more convenient equivalent 
   expression, based on the order statistics, leading to the following function:

gini.md<- function(x)
  { # x=data vector
    j <-order(x)
    n <-length(x)
    return(4*sum((1:length(x))*x[j])/(n*(n-1))
          -2*mean(x)*(n+1)/(n-1))
  }

This is intendend for the basic case of unreplicated data. With some algebraic work 
one should be able to obtain a similar expression for the general case ...once an
agreement has been achieved on the definition of the index.

Clearly, to convert the output from gini.md() to gmd..() one must multiply by
0.5*sqrt(pi).

Here is a short comparison of timing, following the one prepared by Dan Nordlund.

> x<-rnorm(5000)
>  n<-rpois(5000,5)
> w<-n/sum(n)

>  system.time(gmd(x, w))
[1] 241.44   0.21 250.20   0.00   0.00
>  system.time(gmd.1(x, w))
[1]   5.01   8.69 129.95   0.00   0.00
>  system.time(gmd.2(x, w))
[1]  1.21  1.55 10.19  0.00  0.00
> system.time(gini.md(x))
[1] 0 0 0 0 0

also, the new function is  requires little memory

regards

Adelchi Azzalini

-- 
Adelchi Azzalini  <azzalini at stat.unipd.it>
Dipart.Scienze Statistiche, Università di Padova, Italia
http://azzalini.stat.unipd.it/



More information about the R-help mailing list