[R] How a clustering algorithm in R can end up with negative silhouette values?

Martin Maechler maechler at stat.math.ethz.ch
Mon Feb 22 17:08:58 CET 2016


>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Mon, 22 Feb 2016 16:48:39 +0100 writes:

>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>>     on Fri, 19 Feb 2016 15:22:22 -0500 writes:

    >> Ah, my guess about the confusion was wrong, then. You're
    >> misunderstanding silhouette() instead.

    >>> From ?silhouette:

    >> Observations with a large s(i) (almost 1) are very
    >> well clustered, a small s(i) (around 0) means that the
    >> observation lies between two clusters, and observations
    >> with a negative s(i) are probably placed in the wrong
    >> cluster.


    >> In more detail, they're looking at different things.
    >> clara() assigns each point to a cluster based on the
    >> distance to the nearest medoid.

    >> silhouette() does something different: instead of
    >> comparing the distances to the closest medoid and the next
    >> closest medoid, which is what you seem to be assuming,
    >> silhouette() looks at the mean distance to ALL other
    >> points assigned to that cluster, vs the mean distance to
    >> all points in other clusters. The distance to the medoid
    >> is irrelevant, except as it is one of the points in that
    >> cluster.

    >> So a negative silhouette value is entirely possible, and
    >> means that the cluster produced doesn't represent the
    >> dataset very well.

    > Indeed ... and this extends to pam(), even; as you say above,
    > " silhouette() does something different " :

    > If your look at the plots of

    > example(silhouette)

    > where the silhouettes of   pam(ruspini, k = k')  ,  k' = 2,..,6
    > are displayed, or if you directly look at

    > plot( silhouette(ruspini, k = 6) )

oops... that should have been

    plot( silhouette(pam(ruspini, k = 6)) )

    > you will notice that pam() itself can easily lead to negative
    > silhouette values.

    > Martin Maechler  [  == maintainer("cluster")  ]

    

    >> On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
    >> <Behnam.ABABAEI at limagrain.com> wrote:
    >>> Sarah, sorry for taking up your time.
    >>> 
    >>> I totally agree with you about how it works. But please
    >>> let's take a look at this part of the description:
    >>> 
    >>> "Once k representative objects have been selected from
    >>> the sub-dataset, each observation of the entire dataset
    >>> is assigned to the nearest medoid. The mean (equivalent
    >>> to the sum) of the dissimilarities of the observations to
    >>> their closest medoid is used as a measure of the quality
    >>> of the clustering. The sub-dataset for which the mean (or
    >>> sum) is minimal, is retained. A further analysis is
    >>> carried out on the final partition."
    >>> 
    >>> It says each observation is finally assigned to the
    >>> closest medoid. The whole clustering process may be
    >>> imperfect in terms of isolation of clusters, but each
    >>> observation is already assigned to the closest one and
    >>> according to the silhouette formula, the silhouette value
    >>> cannot be negative, as a must be always less than b.
    >>> 
    >>> Regards, Behnam.
    >>> 
    >>> ________________________________________ From: Sarah
    >>> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
    >>> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
    >>> Subject: Re: [R] How a clustering algorithm in R can end
    >>> up with negative silhouette values?
    >>> 
    >>> You need to think more carefully about the details of the
    >>> clara() method.
    >>> 
    >>> The algorithm draws repeated samples of sampsize from the
    >>> larger dataset, as specified by the arguments to the
    >>> function.  It clusters each sample in turn, and saves the
    >>> best one.  It uses the medoids from the best one to
    >>> assign all of the points to a cluster.
    >>> 
    >>> But because the clustering is based on a subsample, it
    >>> may not be representative of the dataset as a whole, and
    >>> may not provide a good clustering overall. Just because
    >>> it clusters the subsample well, doesn't mean it clusters
    >>> the entirety. The details section of the help describes
    >>> this, and the book references goes into more detail.
    >>> 
    >>> Sarah
    >>> 
    >>> 
    >>> 
    >>> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
    >>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>> Hi Sarah,
    >>>> 
    >>>> Thank you for the response. But it is said in its
    >>>> description that after each run (sample), each
    >>>> observation in the whole dataset is assigned to the
    >>>> closest cluster. So how is it possible for one
    >>>> observation to be wrongly allocated, even with clara?
    >>>> 
    >>>> Behnam
    >>>> 
    >>>> Behnam
    >>>> 
    >>>> 
    >>>> 
    >>>> 
    >>>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
    >>>> <sarah.goslee at gmail.com> wrote:
    >>>> 
    >>>> That means that points have been assigned to the wrong
    >>>> groups. This may readily happen with a clustering method
    >>>> like cluster::clara() that uses a subset of the data to
    >>>> cluster a dataset too large to analyze as a
    >>>> unit. Negative silhouette numbers strongly suggest that
    >>>> your clustering parameters should be changed.
    >>>> 
    >>>> Sarah
    >>>> 
    >>>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
    >>>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>>> Hi,
    >>>>> 
    >>>>> 
    >>>>> We know that clustering methods in R assign
    >>>>> observations to the closest medoids. Hence, it is
    >>>>> supposed to be the closest cluster each observation can
    >>>>> have. So, I wonder how it is possible to have negative
    >>>>> values of silhouette , while we are supposedly assign
    >>>>> each observation to the closest cluster and the formula
    >>>>> in silhouette method cannot get negative?
    >>>>> 
    >>>>> 
    >>>>> Behnam.
    >>>>> 

    >> ______________________________________________
    >> R-help at r-project.org mailing list -- To UNSUBSCRIBE and
    >> more, see https://stat.ethz.ch/mailman/listinfo/r-help
    >> PLEASE do read the posting guide
    >> http://www.R-project.org/posting-guide.html and provide
    >> commented, minimal, self-contained, reproducible code.

    > ______________________________________________
    > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
    > https://stat.ethz.ch/mailman/listinfo/r-help
    > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
    > and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list