[R] K-means recluster data with given cluster centers
t.peter.Mueller at gmx.net
t.peter.Mueller at gmx.net
Mon Jan 11 13:19:32 CET 2010
K-means recluster data with given cluster centers
Dear R user,
I have several large data sets. Over time additional new data sets will be created.
I want to cluster all the data in a similar/ identical way with the k-means algorithm.
With the first data set I will find my cluster centers and save the cluster centers to a file [1].
This first data set is huge, it is guarantied that cluster centers will converge.
Afterwards I load my cluster centers and cluster via k-means all other datasets with the same cluster centers [2].
I tried this but now I'm getting in the reclustering step following error message:
"Error: empty cluster: try a better set of initial centers"
That one of the clusters is empty (has no datapoint) should not be a problem. This can happen because the new data sets can be smaller.
What am I doing wrong? Is there a other way to cluster new data in the same way like the old datasets?
Thanks
Peter
1: R code to find cluster center and save them to file
#---INITIAL CLUSTERING TO FIND CLUSTER CENTERS
# LOAD LIB
library(cluster)
# LOAD DATA
data_unclean <- read.table("dataset1.dat")
data.matrix<-as.matrix(data_unclean,"any")
# CLUSTER
Nclust <- 100 # amount cluster centers
Imax <- 200 # amount of iteration for convergence of clustering
set.seed(100) # set seed of random nr generator
init <- sample(dim(data.matrix)[1], Nclust) # this is the initial Nclust prototypes
km <- kmeans(data.matrix, centers=data.matrix[init,], iter.max=Imax)
# WRITE OUT CLUSTER CENTERS
km$centers # print cluster center (columns: dim component; rows: clusters)
km$size # print amount of data in each cluster
clusterCenters=km$centers
save(file="clusterCenters.RData", list='clusterCenters') # Beispiel
write.table(km$centers, file = "clusterCenters.dat", sep = ",", col.names= FALSE, row.names= FALSE)
2: R code to recluster new data
#---RECLUSTER NEW DATA WITH GIVEN CLUSTER CENTERS
# LOAD LIB, SET PARAMETER
library(cluster)
loopStart="0"
loopEnd="10"
# LOAD CLUSTER CENTER
load("clusterCenters.RData") # load cluster centers
# LOOP OVER TRAJ AND RECLUSTER THEM
for(ii in loopStart:loopEnd){
# DEFINE FILENAME
#print(paste("test",ii,sep=""))
filenameInput=paste("dataset",ii,"dat",sep="")
filenameOutput=paste("dataset",ii,"datClusters",sep="")
print(filenameInput)
print(filenameOutput)
# LOAD DATA
data_unclean <- read.table(filenameInput)
data.matrix<-as.matrix(data_unclean,"any")
# RECLUSTER DATA
kmRecluster <- kmeans(data.matrix, centers=clusterCenters, iter.max=1)
kmRecluster$size
# WRITE OUT CLUSTERS FOR EACH DATA
write.table(kmRecluster$cluster, file = filenameOutput, sep = ",", col.names= FALSE, row.names= FALSE)
}
--
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser
More information about the R-help
mailing list