[R] Empty cluster / segfault using vanilla kmeans with version 2.15.2

Tue Mar 12 15:11:37 CET 2013

This example dataset breaks the kmeans in version 2.15.2, installed from
the Belgian CRAN, on an Ubuntu 12.04 LTS 64bit

> my.sample.2
      Day1 Day2 Day3 Day4 Day5 Day6
 [1,]    4    5    5    3    5    5
 [2,]    7    7    6    5    6    6
 [3,]    6    6    5    5    5    5
 [4,]    5    3    4    3    2    4
 [5,]    4    3    2    5    3    2
 [6,]    6    6    6    5    6    6
 [7,]    6    7    6    6    7    6
 [8,]    4    3    5    4    5    5
 [9,]    3    5    5    5    5    6
[10,]    4    5    3    2    4    4
[11,]    7    7    7    5    7    7
[12,]    3    4    2    2    2    2
[13,]    4    6    6    4    6    6
[14,]    5    6    5    6    6    6
[15,]    4    5    5    5    4    3
[16,]    5    6    6    6    6    6
[17,]    7    7    7    6    7    6
[18,]    3    2    3    3    4    2
[19,]    6    5    5    4    5    4
[20,]    5    4    1    5    1    3
[21,]    4    5    5    4    6    5
[22,]    3    4    6    5    6    3
[23,]    2    3    2    3    3    3
[24,]    5    6    5    3    4    5
[25,]    6    6    6    6    6    6
[26,]    5    4    5    5    5    5
[27,]    5    6    6    1    3    6
[28,]    4    4    4    3    3    5
[29,]    6    7    5    5    4    6
[30,]    3    2    2    2    3    2
[31,]    2    4    1    6    4    3
[32,]    4    6    4    5    4    5
[33,]    3    2    2    3    3    3
[34,]    2    3    6    5    4    4
[35,]    2    2    1    1    1    2
[36,]    2    3    2    3    2    3
[37,]    3    6    5    5    3    5
[38,]    7    3    3    7    3    5
[39,]    2    2    4    4    2    4
[40,]    2    4    3    2    3    2

## Define a variable
> hm.clusters <- 5

## Performing kmeans with 100 random starts, several times; for 7 times I
##  get the 'empty cluster' error
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)

## The next attempt provokes the segmentation fault. Please note that there
is
##  nothing special with the 7 times reported above; next time it can
happen on
##  the very first time
>  k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)

 *** caught segfault ***
address 0x10, cause 'memory not mapped'
Segmentation fault (core dumped)

that's about it ... the attached file has been written with write.table(x,
file=...)

I clustered the same dataset with R 2.14.1, same computer, same OS, using
nstart=1000. And I did it 1000 times. Never had the slightest problem.
Moreover, at the cost of repeating myself, the 'empty cluster' is plausibly
the symptom of a bug, because it _should_ never happen with the
Hartigan-Wong algorithm (default for Kmeans)

Kind regards,
and thanks again for your time.

Luca Nanetti

On Sat, Feb 9, 2013 at 8:52 PM, Uwe Ligges
<ligges at statistik.tu-dortmund.de>wrote:

> We need a reproducible example.
>
> Uwe Ligges
>
>
>
> On 03.02.2013 15:03, Luca Nanetti wrote:
>
>> Dear experts,
>> I am encountering a version-dependent issue.
>>
>> My laptop runs Ubuntu 12.04 LTS 64-bit, R 2.14.1; the issue explained
>> below
>> never occurred with this version of R
>> My desktop runs Ubuntu 11.10 64-bit, R 2.13.2; what follows applies to
>> this
>> setup.
>>
>> The data I'm clustering is constituted by the rows of a 320 x 6 matrix
>> containing integers ranging from 1 to 7, no missing data.
>> I applied kmeans() to this matrix, literally, 256 x 10â ¶ times using R
>>
>> version 2.13.2 or 2.14.1, without never experiencing the slightest
>> problem.
>> My usual setup is with k=5, nstart=256, iter.max=50.
>>
>> Upgrading to R 2.15.2, I experienced either a warning message ('Empty
>> cluster. Choose a better set of initial centers') or a catastrophic
>> segfault. The only way I can get a solution whatsoever is putting nstart
>> to
>> its default value, i.e. 1. However, just repeating the clustering, the
>> same
>> issue still happen. Moreover, this is vastly suboptimal, because the risk
>> of local minima.
>>
>> Something similar was reported many years ago, see
>> https://stat.ethz.ch/**pipermail/r-help/2003-**November/041784.html<https://stat.ethz.ch/pipermail/r-help/2003-November/041784.html>.
>> It was
>> then suggested that R's behaviour was correct. I'm not familiar with such
>> an early R version, but the up-to-date documentation of kmeans clearly
>> states that "Except for the Lloyd-Forgy method, k clusters will always be
>> returned if a number is specified.".
>> I am using the default Hartigan-Wong, and I specify an exact number k:
>> thus, k clusters should be returned. They aren't, and the empty cluster is
>> then more likely the symptom of a bug rather than the outcome of a 'true'
>> local minimum.
>>
>> Using synaptic, I managed to downgrade R to version 2.13.2. The problem
>> disappeard, i.e. the previous message/segfault didn't occur anymore.
>>
>> Summarizing: given the same dataset, either an unreasonable message or a
>> segfault regularly happen in version 2.15.2 by invoking kmeans() on an
>> Ubuntu 11.10 64bit machine. This does not happen at all in previous
>> versions of R, on the same machine and operating system.
>>
>> I respectfully suggest that the behaviour shown in the aforementioned
>> versions 2.13.2 and 2.14.1 should be considered 'normal', and that version
>> 2.15.2 should revert to that.
>>
>> Kind regards,
>> Luca Nanetti.
>>
>>         [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________**________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
-------------- next part --------------
"Day1" "Day2" "Day3" "Day4" "Day5" "Day6"
"1" 4 5 5 3 5 5
"2" 7 7 6 5 6 6
"3" 6 6 5 5 5 5
"4" 5 3 4 3 2 4
"5" 4 3 2 5 3 2
"6" 6 6 6 5 6 6
"7" 6 7 6 6 7 6
"8" 4 3 5 4 5 5
"9" 3 5 5 5 5 6
"10" 4 5 3 2 4 4
"11" 7 7 7 5 7 7
"12" 3 4 2 2 2 2
"13" 4 6 6 4 6 6
"14" 5 6 5 6 6 6
"15" 4 5 5 5 4 3
"16" 5 6 6 6 6 6
"17" 7 7 7 6 7 6
"18" 3 2 3 3 4 2
"19" 6 5 5 4 5 4
"20" 5 4 1 5 1 3
"21" 4 5 5 4 6 5
"22" 3 4 6 5 6 3
"23" 2 3 2 3 3 3
"24" 5 6 5 3 4 5
"25" 6 6 6 6 6 6
"26" 5 4 5 5 5 5
"27" 5 6 6 1 3 6
"28" 4 4 4 3 3 5
"29" 6 7 5 5 4 6
"30" 3 2 2 2 3 2
"31" 2 4 1 6 4 3
"32" 4 6 4 5 4 5
"33" 3 2 2 3 3 3
"34" 2 3 6 5 4 4
"35" 2 2 1 1 1 2
"36" 2 3 2 3 2 3
"37" 3 6 5 5 3 5
"38" 7 3 3 7 3 5
"39" 2 2 4 4 2 4
"40" 2 4 3 2 3 2