[R] Fast R implementation of Gini mean difference
Andrew C. Ward
andreww at cheque.uq.edu.au
Mon Apr 28 22:02:04 CEST 2003
Thank you again to Professor Azzalini for taking the time to
expound this issue for me. It's been very instructive to look
in detail at the definition of Gini mean difference and at the
incorporation of weights. Noting suggestions for improving the
speed of my function has also been very helpful.
With my application, the weights refer to the "reliability" of
a measurement, with a weight of 1 signifying high reliability
and a weight close to zero indicating low reliability. The
mean difference is used as a robust estimate of the standard
deviation of the vector of measurements (the 0.5*sqrt(pi)
multiplier is for this purpose).
It seems to me that Professor Azzalini's result may be used for
non-integer weights if the weights are scaled to sum to the
number of measurements (ie. w <- w * length(x)/sum(w)). In this
case, I think that gmd(x=c(1,2,4), w=c(1,2,1)) gives the same
result at gmd(x=c(1,2,2,4), w=c(1,1,1,1)).
Thank you again to all who have contributed to my understanding
and R implementation of the mean difference.
Regards,
Andrew C. Ward
CAPE Centre
Department of Chemical Engineering
The University of Queensland
Brisbane Qld 4072 Australia
andreww at cheque.uq.edu.au
On Monday, April 28, 2003 6:55 PM, Adelchi Azzalini
[SMTP:azzalini at stat.unipd.it] wrote:
>
> This is to complement my previous contribution on computation
of
> Gini mean
> difference - a discussion started by Andrew Ward. The index is
> "defined" as
> gini <- 0
> for (i in 1:n)
> {
> for (j in 1:n) gini <- gini + freq[i]*freq[j]*abs(x
> [i]-x[j])
> }
> gini<- gini/((sum(freq)-1)*sum(freq))
>
> This is the so-called form "without repetition"; the variant
> "with repetition"
> does not have -1 in the final line.
>
> Since computaation via the definition is totally inefficient,
> alternative
> approaches have been put forward, following Andrew's message.
>
> My first version of a computationally convenient implementation
> was
> essentially this:
>
> gini.md0<- function(x)
> { # x=data vector
> n <-length(x)
> return(4*sum((1:length(x))*sort(x)/(n*(n-1)))
> -2*mean(x)*(n+1)/(n-1))
> }
>
>
> Since Andrew (private message) has stressed the importance in
> his problem
> of allowing for replicated data, here is a more general
version,
> obtained by
> elaborating on the previous one with a bit of algebra:
>
> gini.md <- function(x, freq=rep(1,length(x)))
> {# x=data vector, freq=vector of frequencies
> if(!is.vector(x)) stop("x must be a vector")
> if(length(x) != length(freq))
> stop("x and freq must have same length")
> if(min(freq)<0 | sum(freq)==0 | any(freq != as.integer(freq))
> )
> stop("freq must be counts")
> x <- x[freq>0]
> freq <- freq[freq>0]
> j <- order(x)
> x <- x[j]
> n <- as.integer(freq[j])
> n. <- sum(n)
> u <- (cumsum(n)-n)*n+ n*(n+1)/2
> return(4*sum(u*x)/(n.*(n.-1))
> -2*weighted.mean(x,n)*(n.+1)/(n.-1))
> }
>
> Notice that gini.md(x,freq) gives the same of mini.md0(rep
> (x,freq)), but the latter
> is obviously less efficient. Either are however far more
> efficient that straight
> implementation of the "definition".
>
> regards
>
> Adelchi Azzalini
>
> --
> Adelchi Azzalini <azzalini at stat.unipd.it>
> Dipart.Scienze Statistiche, Universita di Padova, Italia
> http://azzalini.stat.unipd.it/
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
More information about the R-help
mailing list