[R] Empty cluster / segfault using vanilla kmeans with version 2.15.2
Luca Nanetti
luca.nanetti at gmail.com
Tue Mar 12 15:11:37 CET 2013
This example dataset breaks the kmeans in version 2.15.2, installed from
the Belgian CRAN, on an Ubuntu 12.04 LTS 64bit
> my.sample.2
Day1 Day2 Day3 Day4 Day5 Day6
[1,] 4 5 5 3 5 5
[2,] 7 7 6 5 6 6
[3,] 6 6 5 5 5 5
[4,] 5 3 4 3 2 4
[5,] 4 3 2 5 3 2
[6,] 6 6 6 5 6 6
[7,] 6 7 6 6 7 6
[8,] 4 3 5 4 5 5
[9,] 3 5 5 5 5 6
[10,] 4 5 3 2 4 4
[11,] 7 7 7 5 7 7
[12,] 3 4 2 2 2 2
[13,] 4 6 6 4 6 6
[14,] 5 6 5 6 6 6
[15,] 4 5 5 5 4 3
[16,] 5 6 6 6 6 6
[17,] 7 7 7 6 7 6
[18,] 3 2 3 3 4 2
[19,] 6 5 5 4 5 4
[20,] 5 4 1 5 1 3
[21,] 4 5 5 4 6 5
[22,] 3 4 6 5 6 3
[23,] 2 3 2 3 3 3
[24,] 5 6 5 3 4 5
[25,] 6 6 6 6 6 6
[26,] 5 4 5 5 5 5
[27,] 5 6 6 1 3 6
[28,] 4 4 4 3 3 5
[29,] 6 7 5 5 4 6
[30,] 3 2 2 2 3 2
[31,] 2 4 1 6 4 3
[32,] 4 6 4 5 4 5
[33,] 3 2 2 3 3 3
[34,] 2 3 6 5 4 4
[35,] 2 2 1 1 1 2
[36,] 2 3 2 3 2 3
[37,] 3 6 5 5 3 5
[38,] 7 3 3 7 3 5
[39,] 2 2 4 4 2 4
[40,] 2 4 3 2 3 2
## Define a variable
> hm.clusters <- 5
## Performing kmeans with 100 random starts, several times; for 7 times I
## get the 'empty cluster' error
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
Error: empty cluster: try a better set of initial centers
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
## The next attempt provokes the segmentation fault. Please note that there
is
## nothing special with the 7 times reported above; next time it can
happen on
## the very first time
> k.liking.ts <- kmeans(my.sample.2, hm.clusters, nstart=100, iter.max=50)
*** caught segfault ***
address 0x10, cause 'memory not mapped'
Segmentation fault (core dumped)
that's about it ... the attached file has been written with write.table(x,
file=...)
I clustered the same dataset with R 2.14.1, same computer, same OS, using
nstart=1000. And I did it 1000 times. Never had the slightest problem.
Moreover, at the cost of repeating myself, the 'empty cluster' is plausibly
the symptom of a bug, because it _should_ never happen with the
Hartigan-Wong algorithm (default for Kmeans)
Kind regards,
and thanks again for your time.
Luca Nanetti
On Sat, Feb 9, 2013 at 8:52 PM, Uwe Ligges
<ligges at statistik.tu-dortmund.de>wrote:
> We need a reproducible example.
>
> Uwe Ligges
>
>
>
> On 03.02.2013 15:03, Luca Nanetti wrote:
>
>> Dear experts,
>> I am encountering a version-dependent issue.
>>
>> My laptop runs Ubuntu 12.04 LTS 64-bit, R 2.14.1; the issue explained
>> below
>> never occurred with this version of R
>> My desktop runs Ubuntu 11.10 64-bit, R 2.13.2; what follows applies to
>> this
>> setup.
>>
>> The data I'm clustering is constituted by the rows of a 320 x 6 matrix
>> containing integers ranging from 1 to 7, no missing data.
>> I applied kmeans() to this matrix, literally, 256 x 10â ¶ times using R
>>
>> version 2.13.2 or 2.14.1, without never experiencing the slightest
>> problem.
>> My usual setup is with k=5, nstart=256, iter.max=50.
>>
>> Upgrading to R 2.15.2, I experienced either a warning message ('Empty
>> cluster. Choose a better set of initial centers') or a catastrophic
>> segfault. The only way I can get a solution whatsoever is putting nstart
>> to
>> its default value, i.e. 1. However, just repeating the clustering, the
>> same
>> issue still happen. Moreover, this is vastly suboptimal, because the risk
>> of local minima.
>>
>> Something similar was reported many years ago, see
>> https://stat.ethz.ch/**pipermail/r-help/2003-**November/041784.html<https://stat.ethz.ch/pipermail/r-help/2003-November/041784.html>.
>> It was
>> then suggested that R's behaviour was correct. I'm not familiar with such
>> an early R version, but the up-to-date documentation of kmeans clearly
>> states that "Except for the Lloyd-Forgy method, k clusters will always be
>> returned if a number is specified.".
>> I am using the default Hartigan-Wong, and I specify an exact number k:
>> thus, k clusters should be returned. They aren't, and the empty cluster is
>> then more likely the symptom of a bug rather than the outcome of a 'true'
>> local minimum.
>>
>> Using synaptic, I managed to downgrade R to version 2.13.2. The problem
>> disappeard, i.e. the previous message/segfault didn't occur anymore.
>>
>> Summarizing: given the same dataset, either an unreasonable message or a
>> segfault regularly happen in version 2.15.2 by invoking kmeans() on an
>> Ubuntu 11.10 64bit machine. This does not happen at all in previous
>> versions of R, on the same machine and operating system.
>>
>> I respectfully suggest that the behaviour shown in the aforementioned
>> versions 2.13.2 and 2.14.1 should be considered 'normal', and that version
>> 2.15.2 should revert to that.
>>
>> Kind regards,
>> Luca Nanetti.
>>
>> [[alternative HTML version deleted]]
>>
>>
>>
>> ______________________________**________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-help<https://stat.ethz.ch/mailman/listinfo/r-help>
>> PLEASE do read the posting guide http://www.R-project.org/**
>> posting-guide.html <http://www.R-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
-------------- next part --------------
"Day1" "Day2" "Day3" "Day4" "Day5" "Day6"
"1" 4 5 5 3 5 5
"2" 7 7 6 5 6 6
"3" 6 6 5 5 5 5
"4" 5 3 4 3 2 4
"5" 4 3 2 5 3 2
"6" 6 6 6 5 6 6
"7" 6 7 6 6 7 6
"8" 4 3 5 4 5 5
"9" 3 5 5 5 5 6
"10" 4 5 3 2 4 4
"11" 7 7 7 5 7 7
"12" 3 4 2 2 2 2
"13" 4 6 6 4 6 6
"14" 5 6 5 6 6 6
"15" 4 5 5 5 4 3
"16" 5 6 6 6 6 6
"17" 7 7 7 6 7 6
"18" 3 2 3 3 4 2
"19" 6 5 5 4 5 4
"20" 5 4 1 5 1 3
"21" 4 5 5 4 6 5
"22" 3 4 6 5 6 3
"23" 2 3 2 3 3 3
"24" 5 6 5 3 4 5
"25" 6 6 6 6 6 6
"26" 5 4 5 5 5 5
"27" 5 6 6 1 3 6
"28" 4 4 4 3 3 5
"29" 6 7 5 5 4 6
"30" 3 2 2 2 3 2
"31" 2 4 1 6 4 3
"32" 4 6 4 5 4 5
"33" 3 2 2 3 3 3
"34" 2 3 6 5 4 4
"35" 2 2 1 1 1 2
"36" 2 3 2 3 2 3
"37" 3 6 5 5 3 5
"38" 7 3 3 7 3 5
"39" 2 2 4 4 2 4
"40" 2 4 3 2 3 2
More information about the R-help
mailing list