[R] speed of the cluster.stats function
Romain François
francoisromain at free.fr
Mon Jan 3 16:27:24 CET 2005
Hello list (happy new yeaR),
Here's a copy of a message i just send to Christian Hennig (who wrote
the fpc package).
That may interrest some of you, and maybe someone could have a better
solution than mine.
Romain.
------------------------------------------------------------------------------------------
Mister Hennig,
[[[ I'm writing in english because i don't know german langage and i
don't know if you know french ]]]
I'm a student in "Institut de Statistique de l'Université de Paris"
using a lot the library fpc that you built for R, specially the
cluster.stats function. In that function, the calculation of G2 index
(Goodman & Kruskal) could be really slow as you warned in the help page
for that function. That speed problem is due to the double loop for(i in
1:nwithin) for(j in 1:nbetween).
I came up with a solution (probably not the best, but ....) that is
really faster than your's (with all due respect). (You can see the speed
calculation above). What i did was just vectorizing the second loop. See
the code in the patch above.
Could be a good thing for the next fpc release.
Cordially.
Romain Francois.
---------------------------- Time calculation
------------------------------------------
> dis <- dist(USArrests) # 50 observations, 4 variables
> hcl <- hclust(dis)
> gro <- cutree(hcl,3)
> system.time(print(cluster.stats(dis,gro,G2=T)$g2))
[1] 0.887726 # the G2 value calculated by your function (just to make
sure that's the same)
[1] 2.87 0.00 2.89 NA NA
^^^^
Warning message: non-square matrix in: as.dist(separation)
> system.time(print(R.cluster.stats(dis,gro,G2=T)$g2))
[1] 0.887726 # the G2 of my function (same value as your's)
[1] 0.12 0.00 0.12 NA NA
^^^^
Warning message: non-square matrix in: as.dist(separation)
---------------------------------------------------------------------------------------
--------------------------- patch
-----------------------------------------------------
...
if (G2) {
splus <- sminus <- 0
for (i in 1:nwithin) {
splus <- splus + sum(within.dist[i]<between.dist)
sminus <- sminus + sum(within.dist[i]>between.dist) }
g2 <- (splus - sminus)/(splus + sminus)
}
...
---------------------------------------------------------------------------------------
--
Romain FRANCOIS : francoisromain at free.fr
page web : http://addictedtor.free.fr/ (en construction)
06 18 39 14 69 / 01 46 80 65 60
_______________________________________________________
Etudiant en 3eme année
Institut de Statistique de l'Université de Paris (ISUP)
Filière Industrie et Services
http://www.isup.cicrp.jussieu.fr/
More information about the R-help
mailing list