[R-sig-eco] CoDA: Clustering Multiple Data Sets

Fri Oct 10 15:08:44 CEST 2014

Hi Rich,

It is not clear whether you need a supervised or an unsupervised model. Clustering is unsupervised: it will classify compositions in hierarchical groups regardless the label (countries, regions). If this is what you intend, you might compute the clustering (hclust) on an euclidean distance matrix (vegdist) performed across the clr- or ilr-transformed data (both return the same distances). If you mean a supervised approach, you might want to explain how groups differ, and/or predict to which group the composition belongs. To explain, discriminant analysis (packages MASS or ade4) is (arguably) often a good choice. To predict a category, you might look at machine learning techniques (see caret package among many others).

Regards,

Essi

De : Rich Shepard
Envoyé : ‎jeudi‎, ‎9‎ ‎octobre‎ ‎2014 ‎15‎:‎13
À : <r-sig-ecology at r-project.org>

   The documentation for packages compositions and robCompositions describe
distance measures and (in the former package) clustering. However, all the
examples, and the function syntax, apply to a single data set.

   This works well with geochemical and official statistical data when the
goal is to examine relationships among the components in the data set. I
find no examples for clustering multiple compositional data sets. For
example, if the expenditures (or expendituresEU) packages in robCompositions
included data from multiple countries and the analytical goal is to cluster
the countries based on each one's compositional data set. The package
AnimalVegetation in the compositions package compares "[A]real compositions
by abundance of vegetation and animals for 50 plots in each of regions A and
B" and appears to be similar to my data: macroinvertebrate compositions by
functional feeding groups and multiple (and variable number) of years in
each of 6 stream networks; each stream network is a separate data set. I
want to cluster the streams based on each data set. Unfortunately, I do not
see an example in package compositions that uses the AnimalVegetation data
for clustering.

   The hclust() function in the stats and compositions package (perhaps the
latter calls the function in the former package) appears to be limited to a
single data set.

   What package and function will allow me to calculate a distance matrix for
these 6 compositional data sets, then use those distances for hierarchical
clustering?

Rich

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
	[[alternative HTML version deleted]]