[R-sig-Geo] Natural Breaks Classification

Roger Bivand Roger.Bivand at nhh.no
Sat Feb 25 10:04:55 CET 2006


On Fri, 24 Feb 2006, David Bitner wrote:

> Most of my problem is that this is my first time using R, so I just
> don't know the functions (or even really the semantic differences) for
> dealing with vectors,arrays, and matrices.

They are all vectors, but arrays and matrices have dimension attributes.

> 
> Thank you very much for the code sample.  I have been looking around,
> is there a good basic reference online to find out basic function
> information so I can reference what things like order, match, unlist,
> tapply, etc. are doing?  I have yet to find just a good functional
> reference doc for R.

http://cran.r-project.org/doc/manuals/R-lang.html

For the functions themselves, see their help pages. Of these four, only 
order is often used interactively, the others are usually within 
functions. match() is a well-written table look-up, unlist() turns a list 
of objects into a flat vector, and tapply() applies a function (defined 
here on the fly to a "ragged" array - the list made by choosing values of 
a for each separate index of cols. So order is fixing the arbitrary 
ordering of the cluster centres for matching with the cluster membership 
to get out cols ordered from lowest class to highest rather than in the 
arbitrary order (cluster numbering) output by kmeans. This is sensible 
because kmeans doesn't know that a has only one column and a natural 
odering - if a was k-dimensional, there would be no natural cluster 
ordering. From that we retrieve the min and max (range) for each cluster 
by using tapply(), and flatten the list it returns.

Hope this helps - I think you'll need a bit more than two numbers out - 
preferably a string saying how you found the breaks.

Roger

> 
> Given that I am not plotting my data in R, I think that you are going
> further than what I actually need.  Given the small example data set,
> all that I really need is just the two numbers 4 and 11 (before I had
> said 3, but I realized the third is unnecessary). From these I plan to
> construct my classes as <4,>=4 and <11, >=11 when I create my
> classification in MapServer where I am doing the plotting.
> 
> David
> 
> 
> 
> On 2/24/06, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> > On Fri, 24 Feb 2006, David Bitner wrote:
> >
> > > I was actually already using that paper as a resource, but I am still
> > > not quite understanding how to get just the breaks that I can use for
> > > classification in an external mapping application.
> > >
> > > Right now I am trying out some samples just to understand what's going on.
> > >
> > > Using this array:
> > > a<-c(11,1,3,3,4,4,4,4,4,4,5)
> > > This gives me the vector (3,1,1,1,2,2,2,2,2,2,2)
> > >
> > > ideally I want the output for this to be just (3,5,11)
> > >
> > > Taking this with the min/max of my data I can create the classes
> > > 1-3,4-5,6-11
> > >
> >
> > Next line on p. 15:
> >
> > > a<-c(11,1,3,3,4,4,4,4,4,4,5)
> > > av <- kmeans(a, 3)
> > > str(av)
> > List of 4
> >  $ cluster : int [1:11] 3 1 1 1 2 2 2 2 2 2 ...
> >  $ centers : num [1:3, 1]  2.33  4.14 11.00
> >   ..- attr(*, "dimnames")=List of 2
> >   .. ..$ : chr [1:3] "1" "2" "3"
> >   .. ..$ : NULL
> >  $ withinss: num [1:3] 2.667 0.857 0.000
> >  $ size    : int [1:3] 3 7 1
> >  - attr(*, "class")= chr "kmeans"
> > > cols <- match(av$cluster, order(av$centers))
> > > cols
> >  [1] 3 1 1 1 2 2 2 2 2 2 2
> > > stbrks <- matrix(unlist(tapply(a, factor(cols), range)), ncol = 2,
> > + byrow = TRUE)
> > > stbrks
> >      [,1] [,2]
> > [1,]    1    3
> > [2,]    4    5
> > [3,]   11   11
> >
> > which then needs unpacking:
> >
> > > apply(stbrks, 1, function(x) paste(x, collapse="-"))
> > [1] "1-3"   "4-5"   "11-11"
> >
> > but which isn't pretty, because there is a gap. As plot(ecdf(jitter(a)),
> > verticals=TRUE) shows, the gap is there in the data, so it is only a
> > matter of convention that class intervals are shown for values that are
> > not in the data, and my preference is to let the data do the talking (but
> > I'm in a minority!), and cartographic convention will probably mean
> > shifting the internal gaps to "look" smaller, and rounding the values to
> > make them easier to handle. For larger numbers of data items, it may be
> > necessary to get the breaks and re-assign the cols because of rounding.
> >
> > The stbrks matrix is just the range within the class for each class for
> > the cols resulting from the classification.
> >
> > With kmeans, note that it can get stuck, so I'd put it inside try() and
> > copy jittered values onto the end of the data vector if it fails - simply
> > because kmeans seeds its classes at random first, and can be unlucky.
> >
> > Please let the list know how you get on, I'm sure class intervals are an
> > interesting topic.
> >
> > Roger
> >
> > >
> > >
> > > On 2/24/06, Roger Bivand <Roger.Bivand at nhh.no> wrote:
> > > > On Fri, 24 Feb 2006, David Bitner wrote:
> > > >
> > > > > I am trying to create some type of a Natural Breaks Classification in
> > > > > PL/R  to classify data that I have in a PostgreSQL/PostGIS database.
> > > > > All that I really need so that I can pass information on to Mapserver
> > > > > to display this data is the class break values (ie an array [3,5,7,9]
> > > > > would mean show values 1-3 in blue, 4-5 in red, etc.).  I am
> > > > > completely new to R, but have a fair bit of experience with the other
> > > > > PL's in PostgreSQL, so the PL part shouldn't be too hard to figure
> > > > > out.
> > > > >
> > > > > I am thinking that kmeans should give me something close to what I
> > > > > want.  My problem is that I am not quite sure to massage the output to
> > > > > get my class breaks or what format to input the data.
> > > > >
> > > > > For arguments, kmeans takes a matrix, number of classes, and a max
> > > > > number of iterations --
> > > > > last two are easy, but how do I convert (do I need to convert) an
> > > > > array into a matrix (is matrix just R speak for an array?)  I will be
> > > > > starting with a one dimension array ie ([1,4,1,1,1,6,4,9,9]).
> > > > >
> > > > > The next issue is spitting the data out.  The docs tell me that I get
> > > > > the centers of the clusters where what I really want are the
> > > > > boundaries of the clusters, how could I get at the "break points" that
> > > > > I am after?
> > > >
> > > > I think you may find some of the code in:
> > > >
> > > > http://spatial.nhh.no/papers/aag04.pdf
> > > >
> > > > useful (though dated). kmeans() and - my preference - bclust() in the
> > > > e1071 package may fail when given too few values, but there are ways round
> > > > that. You'll see code (for example at the top of p. 15) on how to get the
> > > > class centres out - but they - as done there - still leave gaps between
> > > > classes as can be seen from the ECDF plot below. As you can see, kmeans()
> > > > manages with a vector OK.
> > > >
> > > > It would be nice to follow this up as a single function - input the data
> > > > vector and some preferences wrt. number of classes, and output as a number
> > > > of (list of) class intervals with some fitness criterion, something like
> > > > that?
> > > >
> > > > I guess your colour palette is on the PL side?
> > > >
> > > > Roger
> > > >
> > > > >
> > > > > Any help is appreciated,
> > > > > David
> > > > >
> > > > > _______________________________________________
> > > > > R-sig-Geo mailing list
> > > > > R-sig-Geo at stat.math.ethz.ch
> > > > > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> > > > >
> > > >
> > > > --
> > > > Roger Bivand
> > > > Economic Geography Section, Department of Economics, Norwegian School of
> > > > Economics and Business Administration, Helleveien 30, N-5045 Bergen,
> > > > Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
> > > > e-mail: Roger.Bivand at nhh.no
> > > >
> > > >
> > >
> >
> > --
> > Roger Bivand
> > Economic Geography Section, Department of Economics, Norwegian School of
> > Economics and Business Administration, Helleveien 30, N-5045 Bergen,
> > Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
> > e-mail: Roger.Bivand at nhh.no
> >
> >
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the R-sig-Geo mailing list