[R] How a clustering algorithm in R can end up with negative silhouette values?

Tue Feb 23 08:49:23 CET 2016

>>>>> ABABAEI, Behnam <Behnam.ABABAEI at limagrain.com>
>>>>>     on Mon, 22 Feb 2016 16:15:32 +0000 writes:

    > Thank you Martin.
    > I finally came up with this idea: use clara with 100 samples of the size of 1000, then use the clusters as initial runs for kmeans. I think this combines the advantages of clara and kmeans. It gives more distinct clusters than kmeans alone, less time required compared to kmeans alone, and no need for large samples for clara, which take too long to process.

    > Yet I cannot prove which one is better. Clara alone, or that combination method?! Do you have any suggestion?

[Replying to the mailing list, this should have been kept there]

Just a gut feeling:

Clara alone will be simpler and faster, and I cannot see that
extra kmeans iterations would in general be beneficial.

As I said earlier in this thread, I *would* strongly consider using
rngR = TRUE  and more than one (random) run so you get a feeling
about the influence of the random sample at start.
In addition, from the several runs, you could keep the best one.

Martin

    > Best,

    > Behnam

    > On Mon, Feb 22, 2016 at 8:09 AM -0800, "Martin Maechler" <maechler at stat.math.ethz.ch<mailto:maechler at stat.math.ethz.ch>> wrote:

>>>>> Martin Maechler <maechler at stat.math.ethz.ch>
>>>>>     on Mon, 22 Feb 2016 16:48:39 +0100 writes:

>>>>> Sarah Goslee <sarah.goslee at gmail.com>
>>>>>     on Fri, 19 Feb 2016 15:22:22 -0500 writes:

    >>> Ah, my guess about the confusion was wrong, then. You're
    >>> misunderstanding silhouette() instead.

    >>>> From ?silhouette:

    >>> Observations with a large s(i) (almost 1) are very
    >>> well clustered, a small s(i) (around 0) means that the
    >>> observation lies between two clusters, and observations
    >>> with a negative s(i) are probably placed in the wrong
    >>> cluster.

    >>> In more detail, they're looking at different things.
    >>> clara() assigns each point to a cluster based on the
    >>> distance to the nearest medoid.

    >>> silhouette() does something different: instead of
    >>> comparing the distances to the closest medoid and the next
    >>> closest medoid, which is what you seem to be assuming,
    >>> silhouette() looks at the mean distance to ALL other
    >>> points assigned to that cluster, vs the mean distance to
    >>> all points in other clusters. The distance to the medoid
    >>> is irrelevant, except as it is one of the points in that
    >>> cluster.

    >>> So a negative silhouette value is entirely possible, and
    >>> means that the cluster produced doesn't represent the
    >>> dataset very well.

    >> Indeed ... and this extends to pam(), even; as you say above,
    >> " silhouette() does something different " :

    >> If your look at the plots of

    >> example(silhouette)

    >> where the silhouettes of   pam(ruspini, k = k')  ,  k' = 2,..,6
    >> are displayed, or if you directly look at

    >> plot( silhouette(ruspini, k = 6) )

    > oops... that should have been

    > plot( silhouette(pam(ruspini, k = 6)) )

    >> you will notice that pam() itself can easily lead to negative
    >> silhouette values.

    >> Martin Maechler  [  == maintainer("cluster")  ]

    >>> On Fri, Feb 19, 2016 at 3:04 PM, ABABAEI, Behnam
    >>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>> Sarah, sorry for taking up your time.
    >>>> 
    >>>> I totally agree with you about how it works. But please
    >>>> let's take a look at this part of the description:
    >>>> 
    >>>> "Once k representative objects have been selected from
    >>>> the sub-dataset, each observation of the entire dataset
    >>>> is assigned to the nearest medoid. The mean (equivalent
    >>>> to the sum) of the dissimilarities of the observations to
    >>>> their closest medoid is used as a measure of the quality
    >>>> of the clustering. The sub-dataset for which the mean (or
    >>>> sum) is minimal, is retained. A further analysis is
    >>>> carried out on the final partition."
    >>>> 
    >>>> It says each observation is finally assigned to the
    >>>> closest medoid. The whole clustering process may be
    >>>> imperfect in terms of isolation of clusters, but each
    >>>> observation is already assigned to the closest one and
    >>>> according to the silhouette formula, the silhouette value
    >>>> cannot be negative, as a must be always less than b.
    >>>> 
    >>>> Regards, Behnam.
    >>>> 
    >>>> ________________________________________ From: Sarah
    >>>> Goslee <sarah.goslee at gmail.com> Sent: 19 February 2016
    >>>> 20:58 To: ABABAEI, Behnam Cc: r-help at r-project.org
    >>>> Subject: Re: [R] How a clustering algorithm in R can end
    >>>> up with negative silhouette values?
    >>>> 
    >>>> You need to think more carefully about the details of the
    >>>> clara() method.
    >>>> 
    >>>> The algorithm draws repeated samples of sampsize from the
    >>>> larger dataset, as specified by the arguments to the
    >>>> function.  It clusters each sample in turn, and saves the
    >>>> best one.  It uses the medoids from the best one to
    >>>> assign all of the points to a cluster.
    >>>> 
    >>>> But because the clustering is based on a subsample, it
    >>>> may not be representative of the dataset as a whole, and
    >>>> may not provide a good clustering overall. Just because
    >>>> it clusters the subsample well, doesn't mean it clusters
    >>>> the entirety. The details section of the help describes
    >>>> this, and the book references goes into more detail.
    >>>> 
    >>>> Sarah
    >>>> 
    >>>> 
    >>>> 
    >>>> On Fri, Feb 19, 2016 at 2:55 PM, ABABAEI, Behnam
    >>>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>>> Hi Sarah,
    >>>>> 
    >>>>> Thank you for the response. But it is said in its
    >>>>> description that after each run (sample), each
    >>>>> observation in the whole dataset is assigned to the
    >>>>> closest cluster. So how is it possible for one
    >>>>> observation to be wrongly allocated, even with clara?
    >>>>> 
    >>>>> Behnam
    >>>>> 
    >>>>> Behnam
    >>>>> 
    >>>>> 
    >>>>> 
    >>>>> 
    >>>>> On Fri, Feb 19, 2016 at 11:48 AM -0800, "Sarah Goslee"
    >>>>> <sarah.goslee at gmail.com> wrote:
    >>>>> 
    >>>>> That means that points have been assigned to the wrong
    >>>>> groups. This may readily happen with a clustering method
    >>>>> like cluster::clara() that uses a subset of the data to
    >>>>> cluster a dataset too large to analyze as a
    >>>>> unit. Negative silhouette numbers strongly suggest that
    >>>>> your clustering parameters should be changed.
    >>>>> 
    >>>>> Sarah
    >>>>> 
    >>>>> On Fri, Feb 19, 2016 at 6:33 AM, ABABAEI, Behnam
    >>>>> <Behnam.ABABAEI at limagrain.com> wrote:
    >>>>>> Hi,
    >>>>>> 
    >>>>>> 
    >>>>>> We know that clustering methods in R assign
    >>>>>> observations to the closest medoids. Hence, it is
    >>>>>> supposed to be the closest cluster each observation can
    >>>>>> have. So, I wonder how it is possible to have negative
    >>>>>> values of silhouette , while we are supposedly assign
    >>>>>> each observation to the closest cluster and the formula
    >>>>>> in silhouette method cannot get negative?
    >>>>>> 
    >>>>>> 
    >>>>>> Behnam.