[R-sig-eco] [R] Component analysis / cluster analysis of multiple sites based on soil characteristics

Mon Jan 23 15:27:22 CET 2012

Sacha,

    I do not fully understand your objectives, but there are several 
things to bear in mind in your approach below.   You refer to your 
result object as water.pca, but it's simply a distance matrix, not a 
PCA.  More problematic, perhaps, is that it's calculated on a matrix 
with very different values for the columns, e.g. temp is > 30 and no2 is 
< 0.01.  In calculating Euclidean distance (the default for dist()) 
these scales matter a lot.  If it's truly a clustering of sites based on 
these attributes you want you should standardize the columns before 
calculating dist().

    Once you have a distance matrix from the standardized data you could 
use pam or agnes (as you have already done) but might also want to see 
an ordination.  Given a Euclideandistance matrix I would recommend 
Principal Coordinates Analysis (PCO or PCoA depending on source) which I 
believe is available in the ecodist package you already have loaded.

Dave Roberts

On 01/23/2012 05:51 AM, Sacha Viquerat wrote:
> Hello dear list!
> Maybe I am demanding too much, but I am having problems finding the
> right way to tackle a seemingly trivial problem:
>
> We counted fish at different sites. In order to assess habitat quality
> at each site, we sampled temperature, pH etc. at each site, resulting in
> 243 observations of 8 independent variables. As we would like to
> identify clusters within this data set, we stumbled upon three
> approaches: two as realized in package cluster, using dist to create a
> distance matrix from our numeric variables and then pam to produce a
> model or agnes and then various tree methods to simplify the tree, as
> well as an approach via the ecodist package (using distance and pco).
> while results obtained through the cluster package were the same
> (phew!), the result from the ecodist approach did not identify clusters
> at all. As we are all confused and I am the one in charge of deciding
> which way to go, And as I am the one most confused after all, I am
> completely lost. Doing such an anlysis for the first time, I would be
> satisfied wit the pam approach identifying 2 clusters (via iterating
> over each k in 2:10 and picking the max average silhouette of each
> model). However, as there are so many different approaches out there, I
> am not sure if all the assumptions are met! It seems for example that pH
> is more or less randomly distributed. Should we keep such variables? How
> can I access the actual loadings of the principal axis of the pam model?
> Couldn't find that anywhere! In the end, there are only 33 observations
> in the 2nd group, which will be making the further analysis of fish
> counts heavily unbalanced. Any suggestions?
>
> Code snippet:
>
> water.par
>
> temp pH DO BOD COD no3 no2 po4
> 1 33.5 7.4 5.30 4.04 15.0 0.120 0.008 0.20
> 2 33.5 7.4 5.30 4.04 15.0 0.120 0.008 0.20
> .
> .
> .
> 243 29.1 7.4 6.80 12.56 45.0 0.740 0.002 0.32
>
> water.pca<-dist(as.matrix(water.par))
> k=best.k(water.pca,c(2,10),stand=T,trace=1) #finding the k with highest
> average silhouette dist
> clus.model<-pam(water.pca,k,stand=T)
>
> clus.model$clusinfo
>
> size max_diss av_diss diameter separation
> 1 210 30.12712 8.445552 42.29439 27.88689
> 2 33 12.74630 7.725452 21.91972 27.88689
>
> water.md <- distance(water.par, "euclidean")
> water.pco<-pco(water.md)
> plot(water.pco$vectors[,1], water.pco$vectors[,2])
>
> Thanks in advance and sorry for the verbosity level at max!!!
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology