[R] speed of the cluster.stats function

Romain François francoisromain at free.fr
Mon Jan 3 16:27:24 CET 2005


Hello list (happy new yeaR),

Here's a copy of a message i just send to Christian Hennig (who wrote 
the fpc package).
That may interrest some of you, and maybe someone could have a better 
solution than mine.

Romain.

------------------------------------------------------------------------------------------

Mister Hennig,

[[[ I'm writing in english because i don't know german langage and i 
don't know if you know french ]]]

I'm a student in "Institut de Statistique de l'Université de Paris" 
using a lot the library fpc that you built for R, specially the 
cluster.stats function. In that function, the calculation of G2 index 
(Goodman & Kruskal) could be really slow as you warned in the help page 
for that function. That speed problem is due to the double loop for(i in 
1:nwithin) for(j in 1:nbetween).

I came up with a solution (probably not the best, but ....) that is 
really faster than your's (with all due respect). (You can see the speed 
calculation above). What i did was just vectorizing the second loop. See 
the code in the patch above.

Could be a good thing for the next fpc release.

Cordially.

Romain Francois.



---------------------------- Time calculation 
------------------------------------------

> dis <- dist(USArrests) # 50 observations, 4 variables
> hcl <- hclust(dis)
> gro <- cutree(hcl,3)
> system.time(print(cluster.stats(dis,gro,G2=T)$g2))

[1] 0.887726    # the G2 value calculated by your function (just to make 
sure that's the same)
[1] 2.87 0.00 2.89   NA   NA
    ^^^^
Warning message: non-square matrix in: as.dist(separation)

> system.time(print(R.cluster.stats(dis,gro,G2=T)$g2))

[1] 0.887726     # the G2 of my function (same value as your's)
[1] 0.12 0.00 0.12   NA   NA
    ^^^^
Warning message: non-square matrix in: as.dist(separation)
--------------------------------------------------------------------------------------- 


--------------------------- patch 
-----------------------------------------------------
...
   if (G2) {
       splus <- sminus <- 0
       for (i in 1:nwithin) {
          splus  <- splus  + sum(within.dist[i]<between.dist)
          sminus <- sminus + sum(within.dist[i]>between.dist)        }
       g2 <- (splus - sminus)/(splus + sminus)
   }
...
--------------------------------------------------------------------------------------- 


-- 
Romain FRANCOIS : francoisromain at free.fr
page web : http://addictedtor.free.fr/  (en construction)
06 18 39 14 69 / 01 46 80 65 60
_______________________________________________________
Etudiant en 3eme année
Institut de Statistique de l'Université de Paris (ISUP)
Filière Industrie et Services
http://www.isup.cicrp.jussieu.fr/




More information about the R-help mailing list