[R] Using pam, agnes or clara as prediction models?

Thu Jan 15 09:59:37 CET 2004

On Thu, 15 Jan 2004, Renald Buter wrote:

> On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:
> > On Thu, 15 Jan 2004, Renald Buter wrote:
> > 
> > > On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:
> > > > If pam produces the cluster medoids, you should be able to use the
> > > > 1-nearest-neighbor classifier for prediction of future data, using the
> > > > medoids as the `training' data.  1-NN is available in the `class' package,
> > > > part of the `VR' bundle.
> > > > 
> > > 
> > > Thanks very much for your quick answer! I've tried your suggestion in
> > > the following way:
> > > 
> > >  # separate the ruspini data into train and test set
> > >  > train<-ruspini[1:50,]
> > >  > test<-ruspini[51:75,]
> > >  > pamx<-pam(train,4)
> > >  > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
> > >  > knnx
> > >  [1] d d b b d c b c c d c a a d c c a a c a a d c d a
> > >  Levels: a b c d
> > > 
> > > But the result of applying the test set to the knn should only contain 2
> > > clusters, since the upper half of the ruspini data contains only 2
> > > clusters.
> > > 
> > > Could you tell me what I am missing here?
> > 
> > You asked that the upper half be divided into 4 clusters.  Did you look at 
> > the object pamx?  It contains 4 clusters covering only the first part of 
> > the dataset.
> 
> Yes, that what was I understood. My objective was to use this division
> by applying it to the test set: for each point in the test set, predict
> what cluster it would enter.
>
> > Given that when you apply pam to the whole dataset there is a cluster that
> > only occurs for objects 61:75, there is no way you can find that cluster
> > when no member of it is in your training set.
> 
> By isn't that what the knn does: locate the nearest neighbour of a point
> and assigning its (nn) label to the point-to-be-classified?
> 
> I thought that I was doing:
>  1. create a clustering of data using PAM
>  2. train a knn with the medoids of the PAM clustering
>  3. apply the knn to the test set
>  4. look at the result
> 
> Could you tell me what I'm not getting here?

You created a clustering of the training set, yet interpreted it against
the clustering of the whole set using the now irrelevant statement

`the upper half of the ruspini data contains only 2 clusters'

which applies to the wrong clustering.  I pointed out that the training 
set does not contain a single member of one of _those_ clusters so you are 
bound to get a completely different clustering.

When you divided a dataset into `training' and `testing' sets you are 
assuming an least exchangeability whereas this dataset is clearly ordered.
So it is not credible that `train' and `test' are samples from the same 
population.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595