kmeans {stats}  R Documentation 
Perform kmeans clustering on a data matrix.
kmeans(x, centers, iter.max = 10, nstart = 1,
algorithm = c("HartiganWong", "Lloyd", "Forgy",
"MacQueen"), trace = FALSE)
## S3 method for class 'kmeans'
fitted(object, method = c("centers", "classes"), ...)
x 
numeric matrix of data, or an object that can be coerced to such a matrix (such as a numeric vector or a data frame with all numeric columns). 
centers 
either the number of clusters, say 
iter.max 
the maximum number of iterations allowed. 
nstart 
if 
algorithm 
character: may be abbreviated. Note that

object 
an R object of class 
method 
character: may be abbreviated. 
trace 
logical or integer number, currently only used in the
default method ( 
... 
not used. 
The data given by x
are clustered by the k
means method,
which aims to partition the points into k
groups such that the
sum of squares from points to the assigned cluster centres is minimized.
At the minimum, all cluster centres are at the mean of their Voronoi
sets (the set of data points which are nearest to the cluster centre).
The algorithm of Hartigan and Wong (1979) is used by default. Note
that some authors use k
means to refer to a specific algorithm
rather than the general method: most commonly the algorithm given by
MacQueen (1967) but sometimes that given by Lloyd (1957) and Forgy
(1965). The Hartigan–Wong algorithm generally does a better job than
either of those, but trying several random starts (nstart
>
1
) is often recommended. In rare cases, when some of the points
(rows of x
) are extremely close, the algorithm may not converge
in the “QuickTransfer” stage, signalling a warning (and
returning ifault = 4
). Slight
rounding of the data may be advisable in that case.
For ease of programmatic exploration, k = 1
is allowed, notably
returning the center and withinss
.
Except for the Lloyd–Forgy method, k
clusters will always be
returned if a number is specified.
If an initial matrix of centres is supplied, it is possible that
no point will be closest to one or more centres, which is currently
an error for the Hartigan–Wong method.
kmeans
returns an object of class "kmeans"
which has a
print
and a fitted
method. It is a list with at least
the following components:
cluster 
A vector of integers (from 
centers 
A matrix of cluster centres. 
totss 
The total sum of squares. 
withinss 
Vector of withincluster sum of squares, one component per cluster. 
tot.withinss 
Total withincluster sum of squares,
i.e. 
betweenss 
The betweencluster sum of squares,
i.e. 
size 
The number of points in each cluster. 
iter 
The number of (outer) iterations. 
ifault 
integer: indicator of a possible algorithm problem – for experts. 
The clusters are numbered in the returned object, but they are a set and no ordering is implied. (Their apparent ordering may differ by platform.)
Forgy, E. W. (1965). Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics, 21, 768–769.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A Kmeans clustering algorithm. Applied Statistics, 28, 100–108. doi:10.2307/2346830.
Lloyd, S. P. (1957, 1982). Least squares quantization in PCM. Technical Note, Bell Laboratories. Published in 1982 in IEEE Transactions on Information Theory, 28, 128–137.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, eds L. M. Le Cam & J. Neyman, 1, pp. 281–297. Berkeley, CA: University of California Press.
require(graphics)
# a 2dimensional example
x < rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),
matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2))
colnames(x) < c("x", "y")
(cl < kmeans(x, 2))
plot(x, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)
# sum of squares
ss < function(x) sum(scale(x, scale = FALSE)^2)
## cluster centers "fitted" to each obs.:
fitted.x < fitted(cl); head(fitted.x)
resid.x < x  fitted(cl)
## Equalities : 
cbind(cl[c("betweenss", "tot.withinss", "totss")], # the same two columns
c(ss(fitted.x), ss(resid.x), ss(x)))
stopifnot(all.equal(cl$ totss, ss(x)),
all.equal(cl$ tot.withinss, ss(resid.x)),
## these three are the same:
all.equal(cl$ betweenss, ss(fitted.x)),
all.equal(cl$ betweenss, cl$totss  cl$tot.withinss),
## and hence also
all.equal(ss(x), ss(fitted.x) + ss(resid.x))
)
kmeans(x,1)$withinss # trivial onecluster, (its W.SS == ss(x))
## random starts do help here with too many clusters
## (and are often recommended anyway!):
## The ordering of the clusters may be platformdependent.
## IGNORE_RDIFF_BEGIN
(cl < kmeans(x, 5, nstart = 25))
## IGNORE_RDIFF_END
plot(x, col = cl$cluster)
points(cl$centers, col = 1:5, pch = 8)