[R] Remove redundant observations for cross-validation

Eleni Rapsomaniki e.rapsomaniki at mail.cryst.bbk.ac.uk
Fri Aug 10 14:03:49 CEST 2007


This is a general statistics question that I believe occurs often so may have
some R functions/packages dedicated to it.
Suppose you want to check the accuracy of a classifier using a large training
data-set where each row represents an observation. Is there a simple approach
for removing redundant rows (rows with very similar values for all columns)
from the training data so as to obtain a realistic classification performance
upon x-validation? The only one I can think of is clustering the data into an
arbitary number of clusters and selecting one observation from each cluster.

x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
           cbind(rnorm(10,5,2.5), rnorm(15,5,2.5)),
           cbind(rnorm(10,15,0.5), rnorm(15,15,0.5)),
           cbind(rnorm(5,5,0.1), rnorm(5,5,0.1)))
pamx <- pam(x, 15)

y=array(NA, dim=c(15,ncol(x)))
for(i in 1:15){
        y[i,]=x[sample(which(pamx$clustering==i), 1),]

This seems a bit subjective though... Any better ideas?

Eleni Rapsomaniki

More information about the R-help mailing list