[R] Using pam, agnes or clara as prediction models?
Prof Brian Ripley
ripley at stats.ox.ac.uk
Thu Jan 15 09:59:37 CET 2004
On Thu, 15 Jan 2004, Renald Buter wrote:
> On Thu, Jan 15, 2004 at 08:32:45AM +0000, Prof Brian Ripley wrote:
> > On Thu, 15 Jan 2004, Renald Buter wrote:
> >
> > > On Wed, Jan 14, 2004 at 03:18:10PM -0500, Liaw, Andy wrote:
> > > > If pam produces the cluster medoids, you should be able to use the
> > > > 1-nearest-neighbor classifier for prediction of future data, using the
> > > > medoids as the `training' data. 1-NN is available in the `class' package,
> > > > part of the `VR' bundle.
> > > >
> > >
> > > Thanks very much for your quick answer! I've tried your suggestion in
> > > the following way:
> > >
> > > # separate the ruspini data into train and test set
> > > > train<-ruspini[1:50,]
> > > > test<-ruspini[51:75,]
> > > > pamx<-pam(train,4)
> > > > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
> > > > knnx
> > > [1] d d b b d c b c c d c a a d c c a a c a a d c d a
> > > Levels: a b c d
> > >
> > > But the result of applying the test set to the knn should only contain 2
> > > clusters, since the upper half of the ruspini data contains only 2
> > > clusters.
> > >
> > > Could you tell me what I am missing here?
> >
> > You asked that the upper half be divided into 4 clusters. Did you look at
> > the object pamx? It contains 4 clusters covering only the first part of
> > the dataset.
>
> Yes, that what was I understood. My objective was to use this division
> by applying it to the test set: for each point in the test set, predict
> what cluster it would enter.
>
> > Given that when you apply pam to the whole dataset there is a cluster that
> > only occurs for objects 61:75, there is no way you can find that cluster
> > when no member of it is in your training set.
>
> By isn't that what the knn does: locate the nearest neighbour of a point
> and assigning its (nn) label to the point-to-be-classified?
>
> I thought that I was doing:
> 1. create a clustering of data using PAM
> 2. train a knn with the medoids of the PAM clustering
> 3. apply the knn to the test set
> 4. look at the result
>
> Could you tell me what I'm not getting here?
You created a clustering of the training set, yet interpreted it against
the clustering of the whole set using the now irrelevant statement
`the upper half of the ruspini data contains only 2 clusters'
which applies to the wrong clustering. I pointed out that the training
set does not contain a single member of one of _those_ clusters so you are
bound to get a completely different clustering.
When you divided a dataset into `training' and `testing' sets you are
assuming an least exchangeability whereas this dataset is clearly ordered.
So it is not credible that `train' and `test' are samples from the same
population.
--
Brian D. Ripley, ripley at stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
More information about the R-help
mailing list