[R] severe bug in LICORS/kmeanspp
iritzke m@iii@g oii web@de
iritzke m@iii@g oii web@de
Wed Nov 3 23:04:06 CET 2021
Hello,
I found a severe bug in the function kmeanspp (k-means++ implementation) in the LICORS package (https://www.rdocumentation.org/packages/LICORS/versions/0.2.0/topics/kmeanspp) and need some advice on how to handle this (therefore posting here).
Since LICORS has not been updated since 2013, I am not sure if there is an active maintainer. The named maintainer works for Google since 2014 and may have abandoned the package.
On the other hand, this is one of the few implementations of k-means++ in R and coming up first when searching.
The bug leads to inferior results. In its current form, the results were much worse than those of Hartigan-Wong (the default for k-means in the stats package) for all test problems I tried. However, after fixing the bug, kmeanspp found better results than Hartigan-Wong for all those problems. So anyone comparing those two algorithms based on the current implementations in R may have come to the wrong conclusions. BTW: The Hartigan-Wong implementation (Fortran) is *much* faster than LICOR/kmeanspp, which is written in pure R (before making one call to stats/kmeans at the end), but that is not the point here.
The bug concerns a distance computation which should be a matrix of distances of all data vectors and all current codebook vectors, but is not. The code and an example illustrating the problem is shown below. Basically, to subtract a vector from a matrix, one has to convert the vector into a matrix where all rows are just copies of the vector. Details are shown below. The fix is trivial.
I stumbled upon this because kmeanspp produced counterintuitive and very poor results.
Is this of any interest? Should kmeanspp in LICORS be fixed? I have neither the experience nor the time to take over an R-package. I could write a more formal bug report, if required, but I initially would like to know if this is considered relevant.
Suggestions/comments are welcome.
Kind Regards
Bernd Fritzke
The code
if (ndim == 1) {
dists <- apply(cbind(data[center_ids, ]), 1,
function(center) {
rowSums((data - center)^2)
})
}
else {
dists <- apply(data[center_ids, ], 1, function(center) {
rowSums((data - center)^2)
})
}
should rather be (the two changed lines are marked as "fixed"):
if (ndim == 1) {
dists <- apply(cbind(data[center_ids, ]), 1,
function(center) {
rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
})
}
else {
dists <- apply(data[center_ids, ], 1, function(center) {
rowSums((data - rep(center,each=nrow(data) ) )^2) # fixed
})
}
Here is some example code illustrating the problem. The code should compute the square distances between the six two-dimensional vectors in "data" and three vectors which happen to be the elements 1, 2, and 4 in "data".
RSTUDIO
```{r}
data=cbind(1:6,rep(0,6))
print("data")
print(data)
center_ids=c(1,2,4)
print("codebook")
print(data[center_ids, ])
# fixed
dists <- apply(data[center_ids, ], 1,
function(center) {
rowSums((data - rep(center,each=nrow(data)))^2)
}
)
print("dists (correct)")
print(dists)
# buggy
dists <- apply(data[center_ids, ], 1,
function(center) {
rowSums((data - center)^2)
}
)
print("buggy dists from LICORS")
dists
```
Output:
[1] "data"
[,1] [,2]
[1,] 1 0
[2,] 2 0
[3,] 3 0
[4,] 4 0
[5,] 5 0
[6,] 6 0
[1] "codebook"
[,1] [,2]
[1,] 1 0
[2,] 2 0
[3,] 4 0
[1] "dists (correct)"
[,1] [,2] [,3]
[1,] 0 1 9
[2,] 1 0 4
[3,] 4 1 1
[4,] 9 4 0
[5,] 16 9 1
[6,] 25 16 4
[1] "buggy dists from LICORS"
[,1] [,2] [,3]
[1,] 1 5 25
[2,] 4 4 4
[3,] 5 5 17
[4,] 16 16 16
[5,] 17 13 17
[6,] 36 36 36
--
Dr.-Ing. Bernd Fritzke
More information about the R-help
mailing list