[R] Re: clustering polypeptide sequences

Mon Sep 8 14:13:09 CEST 2003

Hi Peter,

   You didn't give a very specific example, but it seems to me that what
you wish to do is not really complicated. I suppose you have created a
table of sequences vs. say hyprophobicity, charge, etc..., something like...

seq	hydroph	arom
b0001 	0.104762 	0.000000
b0002 	0.035122 	0.065854
b0003 	0.024193 	0.070968
b0004 	-0.096729 	0.084112
b0005 	-0.973469 	0.091837
b0006 	-0.402713 	0.108527
b0007 	0.680672 	0.123950
b0008 	-0.209779 	0.072555
b0009 	-0.013334 	0.046154
b0010 	0.952128 	0.143617

suppose you have these data into a data frame called myseqs [see the R
documentation in how to upload these data, you can try       > myseqs <-
edit(read.table()) ]

# you need to load the necessary libraries

library(mva)      # basic clustering
library(cluster)  # more clustering algorithms

# then you need to calculate the 'distances' between sequences

myseqs.d <- dist(myseqs)  # this creates the euclidean distance matrix, try
help(dist) for more info

# then we perform a hierarchical cluster

myseqs.clus <- hclust(myseqs.d)

# now checkout your results

plot(myseqs.clus) # hey! you see how easy it is?

# the documentation for hlcust contains much more info
# other fancy clustering algorithms

myseqs.pam <- pam(myseqs, k = 2)
plot(myseqs.pam)

I hope this is of any help.