[R] Fast R implementation of Gini mean difference
Adelchi Azzalini
azzalini at stat.unipd.it
Thu Apr 24 15:41:42 CEST 2003
On Thursday 24 April 2003 05:17, you wrote:
> I have written the following function to calculate the weighted mean
> difference for univariate data (see
> http://www.xycoon.com/gini_mean_difference.htm for a related
> formula). Unsurprisingly, the function is slow (compared to sd or mad)
> for long vectors. I wonder if there's a way to make the function
> faster, short of creating an external C function. Thanks very much
> for your advice.
>
>
> gmd <- function(x, w) { # x=data vector, w=weights vector
> n <- length(x)
> tmp <- 0
> for (i in 1:n) {
> for (j in 1:n) {
> tmp <- tmp + w[i]*abs(x[i]-x[j])
> }
> }
> retval <- 0.5*sqrt(pi)*tmp/((n-1)*sum(w))
> }
I have a few remarks on the "definition" of Gini index, and one on the most
suitable way to compute it.
(a) there no 0.5*sqrt(pi) neither in the standard definition of the index, nor
in the quoted source, at http://www.xycoon.com/gini_mean_difference.htm
(b) when the values of x vector have frequencies, then the common
definition knwon to me is not the above one, but one such that the nested line
above should be replaced by
tmp <- tmp + w[i]*w[j]*abs(x[i]-x[j])
and the final assignment should be
0.5*sqrt(pi)*tmp/((sum(w)-1)*sum(w))
In fact, the above function produces inconsident results; see this:
> gmd(c(1,2,2,4),c(1,1,1,1))
[1] 1.329
> gmd(c(1,2,4), c(1,2,1))
[1] 1.662
while the above modifications lead again to
> gmd(c(1,2,4), c(1,2,1))
[1] 1.329
(c) from the computational viewpoint, there is a much more convenient equivalent
expression, based on the order statistics, leading to the following function:
gini.md<- function(x)
{ # x=data vector
j <-order(x)
n <-length(x)
return(4*sum((1:length(x))*x[j])/(n*(n-1))
-2*mean(x)*(n+1)/(n-1))
}
This is intendend for the basic case of unreplicated data. With some algebraic work
one should be able to obtain a similar expression for the general case ...once an
agreement has been achieved on the definition of the index.
Clearly, to convert the output from gini.md() to gmd..() one must multiply by
0.5*sqrt(pi).
Here is a short comparison of timing, following the one prepared by Dan Nordlund.
> x<-rnorm(5000)
> n<-rpois(5000,5)
> w<-n/sum(n)
> system.time(gmd(x, w))
[1] 241.44 0.21 250.20 0.00 0.00
> system.time(gmd.1(x, w))
[1] 5.01 8.69 129.95 0.00 0.00
> system.time(gmd.2(x, w))
[1] 1.21 1.55 10.19 0.00 0.00
> system.time(gini.md(x))
[1] 0 0 0 0 0
also, the new function is requires little memory
regards
Adelchi Azzalini
--
Adelchi Azzalini <azzalini at stat.unipd.it>
Dipart.Scienze Statistiche, Università di Padova, Italia
http://azzalini.stat.unipd.it/
More information about the R-help
mailing list