[R] Very Slow Gower Similarity Function
Martin Maechler
maechler at stat.math.ethz.ch
Mon Apr 18 21:33:58 CEST 2005
>>>>> "Tyler" == Tyler Smith <tyler.smith at mail.mcgill.ca>
>>>>> on Mon, 18 Apr 2005 12:10:34 -0400 writes:
Tyler> Hello, I am a relatively new user of R. I have
Tyler> written a basic function to calculate the Gower
Tyler> similarity function. I was motivated to do so partly
Tyler> as an excercise in learning R, and partly because the
Tyler> existing option (vegdist in the vegan package) does
Tyler> not accept missing values.
I don't know what exactly you want.
The function daisy() in the recommended package "cluster"
has always worked with missing values and IIRC, the book
"Kaufman & Rousseeuw" {which I have not at hand here at home},
clearly mentions Gower's origin of their distance measure
definition.
Martin Maechler, maintainer of cluster package,
ETH Zurich
Tyler> I think I have succeeded - my function gives me the
Tyler> correct values. However, now that I'm starting to use
Tyler> it with real data, I realise it's very slow. It takes
Tyler> more than 45 minutes on my Windows 98 machine (R
Tyler> 2.0.1 Patched (2005-03-29)) with a 185x32 matrix with
Tyler> ca 100 missing values. If anyone can suggest ways to
Tyler> speed up my function I would appreciate it. I suspect
Tyler> having a pair of nested for loops is the problem, but
Tyler> I couldn't figure out how to get rid of them.
Tyler> The function is:
Tyler> ### Gower Similarity Matrix###
Tyler> sGow <- function (mat){
Tyler> OBJ <- nrow(mat) #number of objects MATDESC <- ncol
Tyler> (mat) #number of descriptors MRANGE <- apply
Tyler> (mat,2,max, na.rm=T)-apply (mat,2,min,na.rm=T) #descr
Tyler> ranges DESCRIPT <- 1:MATDESC #descriptor index vector
Tyler> smat <- matrix(1, nrow = OBJ, ncol = OBJ) #'empty'
Tyler> similarity matrix
Tyler> for (i in 1:OBJ){ for (j in i:OBJ){
Tyler> ##calculate index vector of non-NA descriptors
Tyler> between objects i and j descvect <- intersect
Tyler> (setdiff (DESCRIPT,
Tyler> DESCRIPT[is.na(mat[i,DESCRIPT])]), setdiff (DESCRIPT,
Tyler> DESCRIPT[is.na (mat[j,DESCRIPT])]))
Tyler> descnum <- length(descvect) # number of valid
Tyler> descr for i~j comparison
Tyler> partialsim <- (1-
Tyler> abs(mat[i,descvect]-mat[j,descvect])/MRANGE[descvect])
Tyler> smat[i,j] <- smat[j,i] <- sum (partialsim) /
Tyler> descnum } } smat }
Tyler> Thank-you for your time,
Tyler> Tyler
Tyler> -- Tyler Smith
Tyler> PhD Candidate Plant Science Department McGill
Tyler> University
Tyler> tyler.smith at mail.mcgill.ca
Tyler> ______________________________________________
Tyler> R-help at stat.math.ethz.ch mailing list
Tyler> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE
Tyler> do read the posting guide!
Tyler> http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list