[R] Using pam, agnes or clara as prediction models?

Thu Jan 15 12:22:18 CET 2004

On Thu, Jan 15, 2004 at 08:59:37AM +0000, Prof Brian Ripley wrote:
[snip]
> > > >  # separate the ruspini data into train and test set
> > > >  > train<-ruspini[1:50,]
> > > >  > test<-ruspini[51:75,]
> > > >  > pamx<-pam(train,4)
> > > >  > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=3)
> > > >  > knnx
> > > >  [1] d d b b d c b c c d c a a d c c a a c a a d c d a
> > > >  Levels: a b c d
> > > > 
> > > > But the result of applying the test set to the knn should only contain 2
> > > > clusters, since the upper half of the ruspini data contains only 2
> > > > clusters.
> > > > 
> > > > Could you tell me what I am missing here?
[snip]
> When you divided a dataset into `training' and `testing' sets you are 
> assuming an least exchangeability whereas this dataset is clearly ordered.
> So it is not credible that `train' and `test' are samples from the same 
> population.
> 

Thank you *very* much for your help. I thought I'd let the list know
what I did to get it right:

 # create a seed vector
 > seed<-rank(runif(75))
 > train<-ruspini[seed[1:60],]
 > test<-ruspini[seed[61:75],]
 > pamx<-pam(train,4)
 > knnx<-knn(pamx$medoids,test,factor(c("a","b","c","d")),k=1)

And now the result makes sense!

Thanks again,

Renald