[R] How a clustering algorithm in R can end up with negative silhouette values?
Martin Maechler
maechler at stat.math.ethz.ch
Mon Feb 22 17:08:58 CET 2016
>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>> on Mon, 22 Feb 2016 16:48:39 +0100 writes:
>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>> on Fri, 19 Feb 2016 15:22:22 -0500 writes:
>> Ah, my guess about the confusion was wrong, then. You're
>> misunderstanding silhouette() instead.
>>> From ?silhouette:
>> Observations with a large s(i) (almost 1) are very
>> well clustered, a small s(i) (around 0) means that the
>> observation lies between two clusters, and observations
>> with a negative s(i) are probably placed in the wrong
>> cluster.
>> In more detail, they're looking at different things.
>> clara() assigns each point to a cluster based on the
>> distance to the nearest medoid.
>> silhouette() does something different: instead of
>> comparing the distances to the closest medoid and the next
>> closest medoid, which is what you seem to be assuming,
>> silhouette() looks at the mean distance to ALL other
>> points assigned to that cluster, vs the mean distance to
>> all points in other clusters. The distance to the medoid
>> is irrelevant, except as it is one of the points in that
>> cluster.
>> So a negative silhouette value is entirely possible, and
>> means that the cluster produced doesn't represent the
>> dataset very well.
> Indeed ... and this extends to pam(), even; as you say above,
> " silhouette() does something different " :
> If your look at the plots of
> example(silhouette)
> where the silhouettes of pam(ruspini, k = k') , k' = 2,..,6
> are displayed, or if you directly look at
> plot( silhouette(ruspini, k = 6) )
oops... that should have been
plot( silhouette(pam(ruspini, k = 6)) )
> you will notice that pam() itself can easily lead to negative
> silhouette values.
> Martin Maechler [ == maintainer("cluster") ]
>> On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
>> <Behnam.ABABAEI at limagrain.com> wrote:
>>> Sarah, sorry for taking up your time.
>>>
>>> I totally agree with you about how it works. But please
>>> let's take a look at this part of the description:
>>>
>>> "Once k representative objects have been selected from
>>> the sub-dataset, each observation of the entire dataset
>>> is assigned to the nearest medoid. The mean (equivalent
>>> to the sum) of the dissimilarities of the observations to
>>> their closest medoid is used as a measure of the quality
>>> of the clustering. The sub-dataset for which the mean (or
>>> sum) is minimal, is retained. A further analysis is
>>> carried out on the final partition."
>>>
>>> It says each observation is finally assigned to the
>>> closest medoid. The whole clustering process may be
>>> imperfect in terms of isolation of clusters, but each
>>> observation is already assigned to the closest one and
>>> according to the silhouette formula, the silhouette value
>>> cannot be negative, as a must be always less than b.
>>>
>>> Regards, Behnam.
>>>
>>> ________________________________________ From: Sarah
>>> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
>>> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
>>> Subject: Re: [R] How a clustering algorithm in R can end
>>> up with negative silhouette values?
>>>
>>> You need to think more carefully about the details of the
>>> clara() method.
>>>
>>> The algorithm draws repeated samples of sampsize from the
>>> larger dataset, as specified by the arguments to the
>>> function. It clusters each sample in turn, and saves the
>>> best one. It uses the medoids from the best one to
>>> assign all of the points to a cluster.
>>>
>>> But because the clustering is based on a subsample, it
>>> may not be representative of the dataset as a whole, and
>>> may not provide a good clustering overall. Just because
>>> it clusters the subsample well, doesn't mean it clusters
>>> the entirety. The details section of the help describes
>>> this, and the book references goes into more detail.
>>>
>>> Sarah
>>>
>>>
>>>
>>> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
>>> <Behnam.ABABAEI at limagrain.com> wrote:
>>>> Hi Sarah,
>>>>
>>>> Thank you for the response. But it is said in its
>>>> description that after each run (sample), each
>>>> observation in the whole dataset is assigned to the
>>>> closest cluster. So how is it possible for one
>>>> observation to be wrongly allocated, even with clara?
>>>>
>>>> Behnam
>>>>
>>>> Behnam
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
>>>> <sarah.goslee at gmail.com> wrote:
>>>>
>>>> That means that points have been assigned to the wrong
>>>> groups. This may readily happen with a clustering method
>>>> like cluster::clara() that uses a subset of the data to
>>>> cluster a dataset too large to analyze as a
>>>> unit. Negative silhouette numbers strongly suggest that
>>>> your clustering parameters should be changed.
>>>>
>>>> Sarah
>>>>
>>>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
>>>> <Behnam.ABABAEI at limagrain.com> wrote:
>>>>> Hi,
>>>>>
>>>>>
>>>>> We know that clustering methods in R assign
>>>>> observations to the closest medoids. Hence, it is
>>>>> supposed to be the closest cluster each observation can
>>>>> have. So, I wonder how it is possible to have negative
>>>>> values of silhouette , while we are supposedly assign
>>>>> each observation to the closest cluster and the formula
>>>>> in silhouette method cannot get negative?
>>>>>
>>>>>
>>>>> Behnam.
>>>>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
>> more, see https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html and provide
>> commented, minimal, self-contained, reproducible code.
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list