[R] Remove redundant observations for cross-validation

Fri Aug 10 14:03:49 CEST 2007

Hi,

This is a general statistics question that I believe occurs often so may have
some R functions/packages dedicated to it.
Suppose you want to check the accuracy of a classifier using a large training
data-set where each row represents an observation. Is there a simple approach
for removing redundant rows (rows with very similar values for all columns)
from the training data so as to obtain a realistic classification performance
upon x-validation? The only one I can think of is clustering the data into an
arbitary number of clusters and selecting one observation from each cluster.

e.g
library(cluster)
x <- rbind(cbind(rnorm(10,0,0.5), rnorm(10,0,0.5)),
           cbind(rnorm(10,5,2.5), rnorm(15,5,2.5)),
           cbind(rnorm(10,15,0.5), rnorm(15,15,0.5)),
           cbind(rnorm(5,5,0.1), rnorm(5,5,0.1)))

pamx <- pam(x, 15)

y=array(NA, dim=c(15,ncol(x)))
for(i in 1:15){
        y[i,]=x[sample(which(pamx$clustering==i), 1),]
}

This seems a bit subjective though... Any better ideas?

Eleni Rapsomaniki