[R] Very Slow Gower Similarity Function

Mon Apr 18 21:33:58 CEST 2005

>>>>> "Tyler" == Tyler Smith <tyler.smith at mail.mcgill.ca>
>>>>>     on Mon, 18 Apr 2005 12:10:34 -0400 writes:

    Tyler> Hello, I am a relatively new user of R. I have
    Tyler> written a basic function to calculate the Gower
    Tyler> similarity function. I was motivated to do so partly
    Tyler> as an excercise in learning R, and partly because the
    Tyler> existing option (vegdist in the vegan package) does
    Tyler> not accept missing values.

I don't know what exactly you want.

The function  daisy() in the recommended package "cluster"
has always worked with missing values and IIRC, the book
"Kaufman & Rousseeuw" {which I have not at hand here at home},
clearly mentions Gower's origin of their distance measure
definition.

Martin Maechler, maintainer of cluster package,
ETH Zurich

    Tyler> I think I have succeeded - my function gives me the
    Tyler> correct values. However, now that I'm starting to use
    Tyler> it with real data, I realise it's very slow. It takes
    Tyler> more than 45 minutes on my Windows 98 machine (R
    Tyler> 2.0.1 Patched (2005-03-29)) with a 185x32 matrix with
    Tyler> ca 100 missing values. If anyone can suggest ways to
    Tyler> speed up my function I would appreciate it. I suspect
    Tyler> having a pair of nested for loops is the problem, but
    Tyler> I couldn't figure out how to get rid of them.

    Tyler> The function is:

    Tyler> ### Gower Similarity Matrix###

    Tyler> sGow <- function (mat){

    Tyler> OBJ <- nrow(mat) #number of objects MATDESC <- ncol
    Tyler> (mat) #number of descriptors MRANGE <- apply
    Tyler> (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr
    Tyler> ranges DESCRIPT <- 1:MATDESC #descriptor index vector
    Tyler> smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty'
    Tyler> similarity matrix

    Tyler> for (i in 1:OBJ){ for (j in i:OBJ){

    Tyler>     ##calculate index vector of non-NA descriptors
    Tyler> between objects i and j descvect <- intersect
    Tyler> (setdiff (DESCRIPT,
    Tyler> DESCRIPT[is.na(mat[i,DESCRIPT])]), setdiff (DESCRIPT,
    Tyler> DESCRIPT[is.na (mat[j,DESCRIPT])]))

    Tyler>     descnum <- length(descvect) # number of valid
    Tyler> descr for i~j comparison

    Tyler>     partialsim <- (1-
    Tyler> abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect])

    Tyler>     smat[i,j] <- smat[j,i] <- sum (partialsim) /
    Tyler> descnum } } smat }

    Tyler> Thank-you for your time,

    Tyler> Tyler

    Tyler> -- Tyler Smith

    Tyler> PhD Candidate Plant Science Department McGill
    Tyler> University

    Tyler> tyler.smith at mail.mcgill.ca

    Tyler> ______________________________________________
    Tyler> R-help at stat.math.ethz.ch mailing list
    Tyler> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
    Tyler> do read the posting guide!
    Tyler> http://www.R-project.org/posting-guide.html