[R-sig-eco] [R] Component analysis / cluster analysis of multiple sites based on soil characteristics
Sacha Viquerat
dawa.ya.moto at googlemail.com
Mon Jan 23 13:51:51 CET 2012
Hello dear list!
Maybe I am demanding too much, but I am having problems finding the
right way to tackle a seemingly trivial problem:
We counted fish at different sites. In order to assess habitat quality
at each site, we sampled temperature, pH etc. at each site, resulting in
243 observations of 8 independent variables. As we would like to
identify clusters within this data set, we stumbled upon three
approaches: two as realized in package cluster, using dist to create a
distance matrix from our numeric variables and then pam to produce a
model or agnes and then various tree methods to simplify the tree, as
well as an approach via the ecodist package (using distance and pco).
while results obtained through the cluster package were the same
(phew!), the result from the ecodist approach did not identify clusters
at all. As we are all confused and I am the one in charge of deciding
which way to go, And as I am the one most confused after all, I am
completely lost. Doing such an anlysis for the first time, I would be
satisfied wit the pam approach identifying 2 clusters (via iterating
over each k in 2:10 and picking the max average silhouette of each
model). However, as there are so many different approaches out there, I
am not sure if all the assumptions are met! It seems for example that pH
is more or less randomly distributed. Should we keep such variables? How
can I access the actual loadings of the principal axis of the pam model?
Couldn't find that anywhere! In the end, there are only 33 observations
in the 2nd group, which will be making the further analysis of fish
counts heavily unbalanced. Any suggestions?
Code snippet:
water.par
temp pH DO BOD COD no3 no2 po4
1 33.5 7.4 5.30 4.04 15.0 0.120 0.008 0.20
2 33.5 7.4 5.30 4.04 15.0 0.120 0.008 0.20
.
.
.
243 29.1 7.4 6.80 12.56 45.0 0.740 0.002 0.32
water.pca<-dist(as.matrix(water.par))
k=best.k(water.pca,c(2,10),stand=T,trace=1) #finding the k with highest
average silhouette dist
clus.model<-pam(water.pca,k,stand=T)
clus.model$clusinfo
size max_diss av_diss diameter separation
1 210 30.12712 8.445552 42.29439 27.88689
2 33 12.74630 7.725452 21.91972 27.88689
water.md <- distance(water.par, "euclidean")
water.pco<-pco(water.md)
plot(water.pco$vectors[,1], water.pco$vectors[,2])
Thanks in advance and sorry for the verbosity level at max!!!
More information about the R-sig-ecology
mailing list