[R] cluster in R

Christian Hennig chrish at stats.ucl.ac.uk
Thu Oct 19 12:37:31 CEST 2006


On Wed, 18 Oct 2006, Weiwei Shi wrote:

> Dear Chris:
>
> I tried to use cor+1 but it still gives me sil width < 0 for average.

Well, then it seems that the clustering is not that good.
I don't know your data and there is no theoretical reason why it has to 
be positive. You should read the Kaufman and Rousseeuw book to understand 
the average silhouette width better.

Best wishes,
Christian

>
>> set.seed(1000)
>> t9 <- cor(t(x), method="pearson")+1 # here i add 1
>> t8 <- as.dist(t9)
>> t7 <- cutree(hclust(t8), 4)
>> cluster.stats(t8, t7)$avg.silwidth
> [1] -0.008750826
>> set.seed(1000)
>> t9 <- cor(t(x), method="pearson") # here I did not add 1
>> t8 <- as.dist(t9)
>> t7 <- cutree(hclust(t8), 4)
>> cluster.stats(t8, t7)$avg.silwidth
> [1] -0.09543089
>
> On 10/18/06, Weiwei Shi <helprhelp at gmail.com> wrote:
>> Dear Chris:
>> 
>> thanks for the prompt reply!
>> 
>> You are right, dist from pearson has negatives there, which I should
>> use cor+1 in my case (since negatively correlated genes should be
>> considered farthest). Thanks.
>> 
>> as to the ?cluster.stats, I double-checked it and I found I need to
>> restart my JGR, until then the help page function starts to accept
>> newly loaded package, like fpc for this case.
>> 
>> sorry for the confusion,
>> 
>> weiwei
>> 
>> On 10/18/06, Christian Hennig <chrish at stats.ucl.ac.uk> wrote:
>> > Dear Weiwei,
>> >
>> > > btw, ?cluster.stats does not work on my Mac machine.
>> > >> version
>> > >              _
>> > > platform       i386-apple-darwin8.6.1
>> > > arch           i386
>> > > os             darwin8.6.1
>> > > system         i386, darwin8.6.1
>> > > status
>> > > major          2
>> > > minor          3.1
>> > > year           2006
>> > > month          06
>> > > day            01
>> > > svn rev        38247
>> > > language       R
>> > > version.string Version 2.3.1 (2006-06-01)
>> >
>> > Because I don't have access to a Mac, I can't tell you anything about
>> > this, unfortunately.
>> > I always thought that my package should work on all platforms if it 
>> passes
>> > all the standard tests for packages?
>> > (Is there anyone else who could comment on this please?)
>> >
>> > > I have a sample like this
>> > >> dim(dd.df)
>> > > [1] 142  28
>> > >
>> > > and I want to cluster rows;
>> > > first of all, I followed the examples for cluster.stats() by
>> > > d.dd <- dist(dd.df) # use Euclidean
>> > > d.4 <- cutree(hclust(d.dd), 4) # 4 clusters I tried
>> > > cluster.stats(d.dd, d.4) # gives me some results like this:
>> > >
>> > > $cluster.size
>> > > [1] 133   5   2   2
>> > >
>> > > $avg.silwidth
>> > > [1] 0.9857916
>> > >
>> > > but when I tried to use pearson dist here, by visualization, i think 4
>> > > or 5 clusters are good for pearson dist, but it gave me a very bad
>> > > avg.siqlwidth
>> > >
>> > > d.dd <- as.dist(cor(t(x),method="pearson")) # is This correct?
>> > > $cluster.size
>> > > [1] 86 31  6 19
>> > >
>> > > $avg.silwidth
>> > > [1] -0.09543089
>> >
>> > cor can give negative values, which doesn't fit the usual definition
>> > of a distance. I don't know what as.dist does in this case, but I think
>> > that, depending on your application, you should rather use the absolute
>> > value of the correlation, or 1+cor.
>> >
>> > > btw, what's $seperation? where can I find the detailed explanation on
>> > > the output from cluster.stats?
>> >
>> > This is documented on the cluster.stats help page:
>> >
>> > separation: vector of clusterwise minimum distances of a point in the
>> >            cluster to a point of another cluster.
>> >
>> > Best regards,
>> > Christian
>> >
>> >
>> > *** --- ***
>> > Christian Hennig
>> > University College London, Department of Statistical Science
>> > Gower St., London WC1E 6BT, phone +44 207 679 1698
>> > chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche
>> >
>> 
>> 
>> --
>> Weiwei Shi, Ph.D
>> Research Scientist
>> GeneGO, Inc.
>> 
>> "Did you always know?"
>> "No, I did not. But I believed..."
>> ---Matrix III
>> 
>
>
> -- 
> Weiwei Shi, Ph.D
> Research Scientist
> GeneGO, Inc.
>
> "Did you always know?"
> "No, I did not. But I believed..."
> ---Matrix III
>

*** --- ***
Christian Hennig
University College London, Department of Statistical Science
Gower St., London WC1E 6BT, phone +44 207 679 1698
chrish at stats.ucl.ac.uk, www.homepages.ucl.ac.uk/~ucakche



More information about the R-help mailing list