[R-sig-eco] [R] Component analysis / cluster analysis of multiple sites based on soil characteristics

Mon Jan 23 13:51:51 CET 2012

Hello dear list!
Maybe I am demanding too much, but I am having problems finding the 
right way to tackle a seemingly trivial problem:

We counted fish at different sites. In order to assess habitat quality 
at each site, we sampled temperature, pH etc. at each site, resulting in 
243 observations of 8 independent variables. As we would like to 
identify clusters within this data set, we stumbled upon three 
approaches: two as realized in package cluster, using dist to create a 
distance matrix from our numeric variables and then pam to produce a 
model or agnes and then various tree methods to simplify the tree, as 
well as an approach via the ecodist package (using distance and pco). 
while results obtained through the cluster package were the same 
(phew!), the result from the ecodist approach did not identify clusters 
at all. As we are all confused and I am the one in charge of deciding 
which way to go, And as I am the one most confused after all, I am 
completely lost. Doing such an anlysis for the first time, I would be 
satisfied wit the pam approach identifying 2 clusters (via iterating 
over each k in 2:10 and picking the max average silhouette of each 
model). However, as there are so many different approaches out there, I 
am not sure if all the assumptions are met! It seems for example that pH 
is more or less randomly distributed. Should we keep such variables? How 
can I access the actual loadings of the principal axis of the pam model? 
Couldn't find that anywhere! In the end, there are only 33 observations 
in the 2nd group, which will be making the further analysis of fish 
counts heavily unbalanced. Any suggestions?

Code snippet:

water.par

     temp  pH   DO   BOD  COD   no3   no2  po4
1   33.5 7.4 5.30  4.04 15.0 0.120 0.008 0.20
2   33.5 7.4 5.30  4.04 15.0 0.120 0.008 0.20
.
.
.
243 29.1 7.4 6.80 12.56 45.0 0.740 0.002 0.32

water.pca<-dist(as.matrix(water.par))
k=best.k(water.pca,c(2,10),stand=T,trace=1) #finding the k with highest 
average silhouette dist
clus.model<-pam(water.pca,k,stand=T)

clus.model$clusinfo

   size max_diss  av_diss diameter separation
1  210 30.12712 8.445552 42.29439   27.88689
2   33 12.74630 7.725452 21.91972   27.88689

water.md <- distance(water.par, "euclidean")
water.pco<-pco(water.md)
plot(water.pco$vectors[,1], water.pco$vectors[,2])

Thanks in advance and sorry for the verbosity level at max!!!