[R] slightly OT: (un)supervised clustering?

Tue Oct 28 21:21:40 CET 2008

On Tuesday 28 October 2008, viktoras didziulis wrote:
> Hi,
>
> my question is not exactly about R... What I am looking for are hints
> and directions on suitable methods (available in R or elsewhere)  to
> solve a grouping (or pattern recognition) problem of environmental
> features in an environmental gradient as described below.
>
> Given environmental sampling data set  (Depth, Presence of sand,
> Presence of boulders, Presence of clay).
> 1 1 1 0
> 1 1 0 0
> 1 1 1 0
> 2 1 1 0
> 3 1 1 0
> 3 1 1 0
> 4 1 1 0
> 5 1 0 0
> 5 1 0 0
> 5 1 1 0
> 5 1 0 0
> 6 1 0 0
> 6 1 0 0
> 6 1 1 0
> 7 1 0 1
> 7 1 0 0
> 8 1 0 1
> 9 1 1 1
> 9 1 0 1
> 9 1 0 1

Are these bore-hole logs? If so check the literature in geophysics / earth 
science / soil science.

> Once I have sampling data ordered by depth, using my own "expert"
> opinion I can distinguish 3 groups A, B, C: A (1 - 4 m depth range) -
> where both sand and boulders are present, B (5 - 6 m range) - where sand
> is dominant with just a few observations of boulders, C (7 - 9 m range)
> - substrate dominated by sand and clay.

hmm. I get something like that with a simple call to pam():

# need this
library(cluster)

# had to make your data into something useable first...
# partition into 4 groups
x.pam <- pam(x, k=4)

# add the clustering vector back to your original data:
x$cluster <- x.pam$clustering

# looks like this
   X1 X2 X3 X4 cluster
1   1  1  1  0       1
2   1  1  0  0       1
3   1  1  1  0       1
4   2  1  1  0       1
5   3  1  1  0       1
6   3  1  1  0       1
7   4  1  1  0       2
8   5  1  0  0       2
9   5  1  0  0       2
10  5  1  1  0       2
11  5  1  0  0       2
12  6  1  0  0       3
13  6  1  0  0       3
14  6  1  1  0       3
15  7  1  0  1       3
16  7  1  0  0       3
17  8  1  0  1       4
18  9  1  1  1       4
19  9  1  0  1       4
20  9  1  0  1       4

Not sure if that is meaningful-- if you are interested in the methods from the 
cluster package, be sure to get the book that it is based on.

> Now the question - is there any formal method that can do the same e.g.
> separate the groups A, B and C by analyzing how does feature occurrence
> patterns change in samples along an environmental gradient (depth in
> this case)? Sample dataset here is simplified, in fact I have to deal
> with a dozen of features like salinity, exposure and related species
> lists. I "see" these groups as an expert, but it would be nice having a
> helper algorithm to see the groups for me, so I could describe it in
> Methods section of my writings :-)

This is a classic problem of variation in some property along some axis of 
anisotropy-- I tend to see this in my field as variation in soil properties 
with depth -aka- horizons.

> Similarity matrix and Cluster analysis or MDS do not perform as
> expected, because it groups stations from group A together with stations
> of other groups that have most similar substrate observations e.g. it
> ignores environmental gradient.

What happens if you were to include some indicator of the gradient in the 
unsupervised classification? See the example above where I included the 
depth.

> Discriminant analysis expects me to do the grouping and then it will
> "decide" the rest. Therefore not suitable.
> A bunch of significance tests can help in deciding whether the
> differences are statistically significant. But again, I have to present
> my own groups, therefore - not suitable.
> Other unsupervised learning algorithms (Neural Networks & Co) - well,
> how can I instruct them to do analysis along an environmental gradient
> of depth ?..

if you have an idea on the number of groupings you are looking for, then the 
pam() and clara() functions in the cluster package may do what you need. 
These are especially nice as they can deal with continuous, ordinal, and 
binary variables. If you do not know how many groups there may be, see the 
diana() and daisy() functions. With all of these use of the 'stand=TRUE' 
argument will be important if your variables are on different scales.

# an example using data from above:
x.hc <- as.hclust(diana(daisy(x[,1:4], stand=TRUE)))
x.hc$labels <- x$cluster
plot(x.hc)

> If anyone among the experts on this list has dealt with similar problems
> before I would highly appreciate if you could briefly describe your
> approaches or point to the right sources.
>
> And in general I am interested in approaches of locating discontinuities
> in data patterns sampled along environmental gradients.

The soil science literature may have some relevent insight into this matter.

Good luck,

Dylan

> Best wishes!
> Viktoras Didziulis
> P.S. just subscribed to this list, sorry if I'm missing something
>
> ______________________________________________

-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341