[R-sig-Geo] spatial clustering taking account of "value"

Wed Mar 10 10:32:34 CET 2010

Hi Dan,

Thanks for that.  Yes indeed it is intended for the entire UK - but
clearly there are large swathes of the country where the 'cluster
value' would be below the nominal threshold of interest - the trick
(in my mind) is to identify the approximate areas of interest and
areas of non-interest without prior knowledge of the clusters - though
clearly central London is likely to be a candidate and the highlands
of Scotland (oil refineries excluded!) are not!

I'll do some experiments later to look at the memory requirements of
relativeneigh() ... this machine has 12gb... we'll see! :-)

cheers and thanks again,
Seán

On Wed, Mar 10, 2010 at 6:50 AM, Dan Putler <dan.putler at sauder.ubc.ca> wrote:
> Hi Sean,
>
> Yes, 1.5 million points is a lot of data. What scale is it on? Is it for
> all of the UK, or a single metropolitan area?
>
> I ask this because you probably want to break the data into smaller
> areas and then cluster each of those areas. You can do this using some
> sort of administrative boundaries, or alternatively, you can construct a
> relative neighbor graph, and then create sub-graphs of the larger graph
> by removing edges from the graph that exceed a threshold. You then
> cluster each of the sub-graphs. You can construct the relative neighbor
> graph using the relativeneigh() function of the spdeps package
> (although, my guess is you will need a machine with a lot of memory).
> Clustering with a spatial density based algorithm is a good choice.
>
> I have actually used this two-step method before to good effect (using
> DBSCAN and modified versions of DBSCAN where the epsilon radius is a
> function of the local population density), although, with data sets of
> 20,000 points (which is typically considered to be a lot of points).
>
> It if you are interested in what I do, you might want to get in touch
> with me off list.
>
> Dan
>
> On Tue, 2010-03-09 at 21:51 -0800, Sean O'Riordain wrote:
>> Good morning,
>>
>> I'm afraid I don't even know *exactly* what I'm looking for - apart
>> from some guidance please!
>>
>> I have about 1.5 million (x,y,value) triples - for the most part these
>> are independent from each other - building location and sum insured.
>> I'm sure there are *lots* of clusters but I've no idea how many, and
>> I'm really only interested in looking at the clusters of highest
>> value.
>>
>> I've already programmed a simple tagging of total value within 500
>> metres of every location - though not every building is accurately
>> tagged - some are only geocoded to UK postcode - so all buildings in a
>> postcode have the same coordinates.
>>
>> I'm looking to highlight "clusters" (definition unclear!) where there
>> are a number of points "close together" (definition unclear!) and the
>> sum of all the values in the "cluster" is "high".  I'm happy to ignore
>> all "low" valued clusters or points which are of low value and all on
>> their own.  There could be a maximum threshold distance (say 5km) or
>> space between points beyond which it is definitely not part of a
>> cluster.  The algorithm doesn't have to perfectly identify all
>> clusters - I'm quite happy to start by looking that a small (say the
>> top 10) set of highest valued "clusters".
>>
>> I've looked at a variety of sources on the web - but it is my
>> understanding that 1 million+ points is considered *very* big for most
>> clustering algorithms.  I've only come across clustering by distance
>> rather than sum of value and distance - I'm probably missing something
>> or mis-interpreting what I'm seeing!  I think I'm looking for a
>> modified form of density clustering...  Clearly I can't create a
>> full-size distance matrix and perfection isn't expected ! :-)  A
>> modified DBSCAN looks like it might be what I'm looking for?
>>
>> Clearly an alternative to clustering is some sort of density algorithm
>> that allows for value - but I can't quite get my head around how this
>> might work.
>>
>> Could someone point me in the right direction - what other keywords
>> should I be looking out for?  what R packages are worth a look?
>>
>> Thanks in advance,
>> Sean O'Riordain
>> Dublin,
>> Ireland
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> --
> Dan Putler
> Sauder School of Business
> University of British Columbia
>
>