[R-sig-Geo] spatial clustering taking account of "value"

Wed Mar 10 11:06:49 CET 2010

I may be misinterpreting your desired output, but could you not also
perform LISA on this dataset?
For example if you are interested in:

>> highlight[ing] "clusters" (definition unclear!) where there
>> are a number of points "close together" (definition unclear!) and the
>> sum of all the values in the "cluster" is "high".

Then the Getis and Ord Gi(*) statistic (not *truly* a LISA) would
likely do the trick for you.

> library(spdep)
> ?localG

>From the help:
The local spatial statistic G is calculated for each zone based on the
spatial weights object used. The value returned is a Z-value, and may
be used as a diagnostic tool. High positive values indicate the
posibility of a local cluster of high values of the variable being
analysed, very low relative values a similar cluster of low values.
The spatial weights object can be created using (as you mentioned
above) a distance threshold, or based on relativeneigh() I imagine.

Regards,

Carson

On Wed, Mar 10, 2010 at 9:32 AM, Sean O'Riordain <seanpor at acm.org> wrote:
> Hi Dan,
>
> Thanks for that.  Yes indeed it is intended for the entire UK - but
> clearly there are large swathes of the country where the 'cluster
> value' would be below the nominal threshold of interest - the trick
> (in my mind) is to identify the approximate areas of interest and
> areas of non-interest without prior knowledge of the clusters - though
> clearly central London is likely to be a candidate and the highlands
> of Scotland (oil refineries excluded!) are not!
>
> I'll do some experiments later to look at the memory requirements of
> relativeneigh() ... this machine has 12gb... we'll see! :-)
>
> cheers and thanks again,
> Seán
>
>
>
> On Wed, Mar 10, 2010 at 6:50 AM, Dan Putler <dan.putler at sauder.ubc.ca> wrote:
>> Hi Sean,
>>
>> Yes, 1.5 million points is a lot of data. What scale is it on? Is it for
>> all of the UK, or a single metropolitan area?
>>
>> I ask this because you probably want to break the data into smaller
>> areas and then cluster each of those areas. You can do this using some
>> sort of administrative boundaries, or alternatively, you can construct a
>> relative neighbor graph, and then create sub-graphs of the larger graph
>> by removing edges from the graph that exceed a threshold. You then
>> cluster each of the sub-graphs. You can construct the relative neighbor
>> graph using the relativeneigh() function of the spdeps package
>> (although, my guess is you will need a machine with a lot of memory).
>> Clustering with a spatial density based algorithm is a good choice.
>>
>> I have actually used this two-step method before to good effect (using
>> DBSCAN and modified versions of DBSCAN where the epsilon radius is a
>> function of the local population density), although, with data sets of
>> 20,000 points (which is typically considered to be a lot of points).
>>
>> It if you are interested in what I do, you might want to get in touch
>> with me off list.
>>
>> Dan
>>
>> On Tue, 2010-03-09 at 21:51 -0800, Sean O'Riordain wrote:
>>> Good morning,
>>>
>>> I'm afraid I don't even know *exactly* what I'm looking for - apart
>>> from some guidance please!
>>>
>>> I have about 1.5 million (x,y,value) triples - for the most part these
>>> are independent from each other - building location and sum insured.
>>> I'm sure there are *lots* of clusters but I've no idea how many, and
>>> I'm really only interested in looking at the clusters of highest
>>> value.
>>>
>>> I've already programmed a simple tagging of total value within 500
>>> metres of every location - though not every building is accurately
>>> tagged - some are only geocoded to UK postcode - so all buildings in a
>>> postcode have the same coordinates.
>>>
>>> I'm looking to highlight "clusters" (definition unclear!) where there
>>> are a number of points "close together" (definition unclear!) and the
>>> sum of all the values in the "cluster" is "high".  I'm happy to ignore
>>> all "low" valued clusters or points which are of low value and all on
>>> their own.  There could be a maximum threshold distance (say 5km) or
>>> space between points beyond which it is definitely not part of a
>>> cluster.  The algorithm doesn't have to perfectly identify all
>>> clusters - I'm quite happy to start by looking that a small (say the
>>> top 10) set of highest valued "clusters".
>>>
>>> I've looked at a variety of sources on the web - but it is my
>>> understanding that 1 million+ points is considered *very* big for most
>>> clustering algorithms.  I've only come across clustering by distance
>>> rather than sum of value and distance - I'm probably missing something
>>> or mis-interpreting what I'm seeing!  I think I'm looking for a
>>> modified form of density clustering...  Clearly I can't create a
>>> full-size distance matrix and perfection isn't expected ! :-)  A
>>> modified DBSCAN looks like it might be what I'm looking for?
>>>
>>> Clearly an alternative to clustering is some sort of density algorithm
>>> that allows for value - but I can't quite get my head around how this
>>> might work.
>>>
>>> Could someone point me in the right direction - what other keywords
>>> should I be looking out for?  what R packages are worth a look?
>>>
>>> Thanks in advance,
>>> Sean O'Riordain
>>> Dublin,
>>> Ireland
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>> --
>> Dan Putler
>> Sauder School of Business
>> University of British Columbia
>>
>>
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Carson J. Q. Farmer
ISSP Doctoral Fellow
National Centre for Geocomputation
National University of Ireland, Maynooth,
http://www.carsonfarmer.com/