[R-sig-Geo] Natural Breaks Classification

Roger Bivand Roger.Bivand at nhh.no
Fri Feb 24 22:49:51 CET 2006


On Fri, 24 Feb 2006, David Bitner wrote:

> I was actually already using that paper as a resource, but I am still
> not quite understanding how to get just the breaks that I can use for
> classification in an external mapping application.
> 
> Right now I am trying out some samples just to understand what's going on.
> 
> Using this array:
> a<-c(11,1,3,3,4,4,4,4,4,4,5)
> This gives me the vector (3,1,1,1,2,2,2,2,2,2,2)
> 
> ideally I want the output for this to be just (3,5,11)
> 
> Taking this with the min/max of my data I can create the classes
> 1-3,4-5,6-11
> 

Next line on p. 15:

> a<-c(11,1,3,3,4,4,4,4,4,4,5)
> av <- kmeans(a, 3)
> str(av)
List of 4
 $ cluster : int [1:11] 3 1 1 1 2 2 2 2 2 2 ...
 $ centers : num [1:3, 1]  2.33  4.14 11.00
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:3] "1" "2" "3"
  .. ..$ : NULL
 $ withinss: num [1:3] 2.667 0.857 0.000
 $ size    : int [1:3] 3 7 1
 - attr(*, "class")= chr "kmeans"
> cols <- match(av$cluster, order(av$centers))
> cols
 [1] 3 1 1 1 2 2 2 2 2 2 2
> stbrks <- matrix(unlist(tapply(a, factor(cols), range)), ncol = 2, 
+ byrow = TRUE)
> stbrks
     [,1] [,2]
[1,]    1    3
[2,]    4    5
[3,]   11   11

which then needs unpacking:

> apply(stbrks, 1, function(x) paste(x, collapse="-"))
[1] "1-3"   "4-5"   "11-11"

but which isn't pretty, because there is a gap. As plot(ecdf(jitter(a)),
verticals=TRUE) shows, the gap is there in the data, so it is only a
matter of convention that class intervals are shown for values that are
not in the data, and my preference is to let the data do the talking (but
I'm in a minority!), and cartographic convention will probably mean
shifting the internal gaps to "look" smaller, and rounding the values to 
make them easier to handle. For larger numbers of data items, it may be 
necessary to get the breaks and re-assign the cols because of rounding.

The stbrks matrix is just the range within the class for each class for 
the cols resulting from the classification.

With kmeans, note that it can get stuck, so I'd put it inside try() and
copy jittered values onto the end of the data vector if it fails - simply
because kmeans seeds its classes at random first, and can be unlucky.

Please let the list know how you get on, I'm sure class intervals are an 
interesting topic.

Roger

> 
> 
> On 2/24/06, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> > On Fri, 24 Feb 2006, David Bitner wrote:
> >
> > > I am trying to create some type of a Natural Breaks Classification in
> > > PL/R  to classify data that I have in a PostgreSQL/PostGIS database.
> > > All that I really need so that I can pass information on to Mapserver
> > > to display this data is the class break values (ie an array [3,5,7,9]
> > > would mean show values 1-3 in blue, 4-5 in red, etc.).  I am
> > > completely new to R, but have a fair bit of experience with the other
> > > PL's in PostgreSQL, so the PL part shouldn't be too hard to figure
> > > out.
> > >
> > > I am thinking that kmeans should give me something close to what I
> > > want.  My problem is that I am not quite sure to massage the output to
> > > get my class breaks or what format to input the data.
> > >
> > > For arguments, kmeans takes a matrix, number of classes, and a max
> > > number of iterations --
> > > last two are easy, but how do I convert (do I need to convert) an
> > > array into a matrix (is matrix just R speak for an array?)  I will be
> > > starting with a one dimension array ie ([1,4,1,1,1,6,4,9,9]).
> > >
> > > The next issue is spitting the data out.  The docs tell me that I get
> > > the centers of the clusters where what I really want are the
> > > boundaries of the clusters, how could I get at the "break points" that
> > > I am after?
> >
> > I think you may find some of the code in:
> >
> > http://spatial.nhh.no/papers/aag04.pdf
> >
> > useful (though dated). kmeans() and - my preference - bclust() in the
> > e1071 package may fail when given too few values, but there are ways round
> > that. You'll see code (for example at the top of p. 15) on how to get the
> > class centres out - but they - as done there - still leave gaps between
> > classes as can be seen from the ECDF plot below. As you can see, kmeans()
> > manages with a vector OK.
> >
> > It would be nice to follow this up as a single function - input the data
> > vector and some preferences wrt. number of classes, and output as a number
> > of (list of) class intervals with some fitness criterion, something like
> > that?
> >
> > I guess your colour palette is on the PL side?
> >
> > Roger
> >
> > >
> > > Any help is appreciated,
> > > David
> > >
> > > _______________________________________________
> > > R-sig-Geo mailing list
> > > R-sig-Geo at stat.math.ethz.ch
> > > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> > >
> >
> > --
> > Roger Bivand
> > Economic Geography Section, Department of Economics, Norwegian School of
> > Economics and Business Administration, Helleveien 30, N-5045 Bergen,
> > Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
> > e-mail: Roger.Bivand at nhh.no
> >
> >
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the R-sig-Geo mailing list