[R-sig-Geo] spatial clustering taking account of "value"

Wed Mar 10 07:50:25 CET 2010

Hi Sean,

Yes, 1.5 million points is a lot of data. What scale is it on? Is it for
all of the UK, or a single metropolitan area?

I ask this because you probably want to break the data into smaller
areas and then cluster each of those areas. You can do this using some
sort of administrative boundaries, or alternatively, you can construct a
relative neighbor graph, and then create sub-graphs of the larger graph
by removing edges from the graph that exceed a threshold. You then
cluster each of the sub-graphs. You can construct the relative neighbor
graph using the relativeneigh() function of the spdeps package
(although, my guess is you will need a machine with a lot of memory).
Clustering with a spatial density based algorithm is a good choice.

I have actually used this two-step method before to good effect (using
DBSCAN and modified versions of DBSCAN where the epsilon radius is a
function of the local population density), although, with data sets of
20,000 points (which is typically considered to be a lot of points).

It if you are interested in what I do, you might want to get in touch
with me off list.

Dan

On Tue, 2010-03-09 at 21:51 -0800, Sean O'Riordain wrote:
> Good morning,
> 
> I'm afraid I don't even know *exactly* what I'm looking for - apart
> from some guidance please!
> 
> I have about 1.5 million (x,y,value) triples - for the most part these
> are independent from each other - building location and sum insured.
> I'm sure there are *lots* of clusters but I've no idea how many, and
> I'm really only interested in looking at the clusters of highest
> value.
> 
> I've already programmed a simple tagging of total value within 500
> metres of every location - though not every building is accurately
> tagged - some are only geocoded to UK postcode - so all buildings in a
> postcode have the same coordinates.
> 
> I'm looking to highlight "clusters" (definition unclear!) where there
> are a number of points "close together" (definition unclear!) and the
> sum of all the values in the "cluster" is "high".  I'm happy to ignore
> all "low" valued clusters or points which are of low value and all on
> their own.  There could be a maximum threshold distance (say 5km) or
> space between points beyond which it is definitely not part of a
> cluster.  The algorithm doesn't have to perfectly identify all
> clusters - I'm quite happy to start by looking that a small (say the
> top 10) set of highest valued "clusters".
> 
> I've looked at a variety of sources on the web - but it is my
> understanding that 1 million+ points is considered *very* big for most
> clustering algorithms.  I've only come across clustering by distance
> rather than sum of value and distance - I'm probably missing something
> or mis-interpreting what I'm seeing!  I think I'm looking for a
> modified form of density clustering...  Clearly I can't create a
> full-size distance matrix and perfection isn't expected ! :-)  A
> modified DBSCAN looks like it might be what I'm looking for?
> 
> Clearly an alternative to clustering is some sort of density algorithm
> that allows for value - but I can't quite get my head around how this
> might work.
> 
> Could someone point me in the right direction - what other keywords
> should I be looking out for?  what R packages are worth a look?
> 
> Thanks in advance,
> Sean O'Riordain
> Dublin,
> Ireland
> 
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
-- 
Dan Putler
Sauder School of Business
University of British Columbia